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Preface 


This book provides a detailed treatment of microeconometric analysis, the analysis of 
individual-level data on the economic behavior of individuals or firms. This type of 
analysis usually entails applying regression methods to cross-section and panel data. 

The book aims at providing the practitioner with a comprehensive coverage of sta- 
tistical methods and their application in modern applied microeconometrics research. 
These methods include nonlinear modeling, inference under minimal distributional 
assumptions, identifying and measuring causation rather than mere association, and 
correcting departures from simple random sampling. Many of these features are of 
relevance to individual-level data analysis throughout the social sciences. 

The ambitious agenda has determined the characteristics of this book. First, al- 
though oriented to the practitioner, the book is relatively advanced in places. A cook- 
book approach is inadequate because when two or more complications occur simulta- 
neously — a common situation — the practitioner must know enough to be able to adapt 
available methods. Second, the book provides considerable coverage of practical data 
problems (see especially the last three chapters). Third, the book includes substantial 
empirical examples in many chapters to illustrate some of the methods covered. Fi- 
nally, the book is unusually long. Despite this length we have been space-constrained. 
We had intended to include even more empirical examples, and abbreviated presen- 
tations will at times fail to recognize the accomplishments of researchers who have 
made substantive contributions. 

The book assumes a good understanding of the linear regression model with matrix 
algebra. It is written at the mathematical level of the first-year economics Ph.D. se- 
quence, comparable to Greene (2003). We have two types of readers in mind. First, the 
book can be used as a course text for a microeconometrics course, typically taught in 
the second year of the Ph.D., or for data-oriented microeconomics field courses such 
as labor economics, public economics, and industrial organization. Second, the book 
can be used as a reference work for graduate students and applied researchers who 
despite training in microeconometrics will inevitably have gaps that they wish to fill. 

For instructors using this book as an econometrics course text it is best to introduce 
the basic nonlinear cross-section and linear panel data models as early as possible, 


xxi 


PREFACE 


initially skipping many of the methods chapters. The key methods chapter (Chapter 5) 
covers maximum-likelihood and nonlinear least-squares estimation. Knowledge of 
maximum likelihood and nonlinear least-squares estimators provides adequate back- 
ground for the most commonly used nonlinear cross-section models (Chapters 14—17 
and 20), basic linear panel data models (Chapter 21), and treatment evaluation meth- 
ods (Chapter 25). Generalized method of moments estimation (Chapter 6) is needed 
especially for advanced linear panel data methods (Chapter 22). 

For readers using this book as a reference work, the chapters have been written to be 
as self-contained as possible. The notable exception is that some command of general 
estimation results in Chapter 5, and occasionally Chapter 6, will be necessary. Most 
chapters on models are structured to begin with a discussion and example that is acces- 
sible to a wide audience. 
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PART ONE 


Preliminaries 


Part 1 covers the essential components of microeconometric analysis — an economic 
specification, a statistical model and a data set. 

Chapter 1 discusses the distinctive aspects of microeconometrics, and provides an 
outline of the book. It emphasizes that discreteness of data, and nonlinearity and het- 
erogeneity of behavioral relationships are key aspects of individual-level microecono- 
metric models. It concludes by presenting the notation and conventions used through- 
out the book. 

Chapters 2 and 3 set the scene for the remainder of the book by introducing the 
reader to key model and data concepts that shape the analyses of later chapters. 

A key distinction in econometrics is between essentially descriptive models and 
data summaries at various levels of statistical sophistication and models that go be- 
yond associations and attempt to estimate causal parameters. The classic definitions 
of causality in econometrics derive from the Cowles Commission simultaneous equa- 
tions models that draw sharp distinctions between exogenous and endogenous vari- 
ables, and between structural and reduced form parameters. Although reduced form 
models are very useful for some purposes, knowledge of structural or causal parame- 
ters is essential for policy analyses. Identification of structural parameters within the 
simultaneous equations framework poses numerous conceptual and practical difficul- 
ties. An increasingly-used alternative approach based on the potential outcome model, 
also attempts to identify causal parameters but it does so by posing limited questions 
within a more manageable framework. Chapter 2 attempts to provide an overview of 
the fundamental issues that arise in these and other alternative frameworks. Readers 
who initially find this material challenging should return to this chapter after gaining 
greater familiarity with specific models covered later in the book. 

The empirical researcher’s ability to identify causal parameters depends not only 
on the statistical tools and models but also on the type of data available. An experi- 
mental framework provides a standard for establishing causal connections. However, 
observational, not experimental, data form the basis of much of econometric inference. 
Chapter 3 surveys the pros and cons of three main types of data: observational data, 
data from social experiments, and data from natural experiments. The strengths and 
weaknesses of conducting causal inference based on each type of data are reviewed. 
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CHAPTER 1 


Overview 


1.1. Introduction 


This book provides a detailed treatment of microeconometric analysis, the analysis 
of individual-level data on the economic behavior of individuals or firms. A broader 
definition would also include grouped data. Usually regression methods are applied to 
cross-section or panel data. 

Analysis of individual data has a long history. Ernst Engel (1857) was among the 
earliest quantitative investigators of household budgets. Allen and Bowley (1935), 
Houthakker (1957), and Prais and Houthakker (1955) made important contributions 
following the same research and modeling tradition. Other landmark studies that were 
also influential in stimulating the development of microeconometrics, even though 
they did not always use individual-level information, include those by Marschak and 
Andrews (1944) in production theory and by Wold and Jureen (1953), Stone (1953), 
and Tobin (1958) in consumer demand. 

As important as the above earlier cited work is on household budgets and demand 
analysis, the material covered in this book has stronger connections with the work on 
discrete choice analysis and censored and truncated variable models that saw their first 
serious econometric applications in the work of McFadden (1973, 1984) and Heckman 
(1974, 1979), respectively. These works involved a major departure from the over- 
whelming reliance on linear models that characterized earlier work. Subsequently, they 
have led to significant methodological innovations in econometrics. Among the earlier 
textbook-level treatments of this material (and more) are the works of Maddala (1983) 
and Amemiya (1985). As emphasized by Heckman (2001), McFadden (2001), and oth- 
ers, many of the fundamental issues that dominated earlier work based on market data 
remain important, especially concerning the conditions necessary for identifiability of 
causal economic relations. Nonetheless, the style of microeconometrics is sufficiently 
distinct to justify writing a text that is exclusively devoted to it. 

Modern microeconometrics based on individual-, household-, and establishment- 
level data owes a great deal to the greater availability of data from cross-section 
and longitudinal sample surveys and census data. In the past two decades, with the 
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expansion of electronic recording and collection of data at the individual level, data 
volume has grown explosively. So too has the available computing power for analyzing 
large and complex data sets. In many cases event-level data are available; for example, 
marketing science often deals with purchase data collected by electronic scanners in 
supermarkets, and industrial organization literature contains econometric analyses of 
airline travel data collected by online booking systems. There are now new branches of 
economics, such as social experimentation and experimental economics, that generate 
“experimental” data. These developments have created many new modeling opportu- 
nities that are absent when only aggregated market-level data are available. Meanwhile 
the explosive growth in the volume and types of data has also given rise to numerous 
methodological issues. Processing and econometric analysis of such large microdata- 
bases, with the objective of uncovering patterns of economic behavior, constitutes the 
core of microeconometrics. Econometric analysis of such data is the subject matter of 
this book. 

Key precursors of this book are the books by Maddala (1983) and Amemiya (1985). 
Like them it covers topics that are presented only briefly, or not at all, in undergraduate 
and first-year graduate econometrics courses. Especially compared to Amemiya (1985) 
this book is more oriented to the practitioner. The level of presentation is nonetheless 
advanced in places, especially for applied researchers in disciplines that are less math- 
ematically oriented than economics. 

A relatively advanced presentation is needed for several reasons. First, the data are 
often discrete or censored, in which case nonlinear methods such as logit, probit, 
and Tobit models are used. This leads to statistical inference based on more difficult 
asymptotic theory. 

Second, distributional assumptions for such data become critically important. One 
response is to develop highly parametric models that are sufficiently detailed to capture 
the complexities of data, but these models can be challenging to estimate. A more com- 
mon response is to minimize parametric assumptions and perform statistical inference 
based on standard errors that are “robust” to complications such as heteroskedasticity 
and clustering. In such cases considerable knowledge can be needed to ensure valid 
statistical inference even if a standard regression package is used. 

Third, economic studies often aim to determine causation rather than merely mea- 
sure correlation, despite access to observational rather than experimental data. This 
leads to methods to isolate causation such as instrumental variables, simultaneous 
equations, measurement error correction, selection bias correction, panel data fixed 
effects, and differences-in-differences. 

Fourth, microeconomic data are typically collected using cross-section and panel 
surveys, censuses, or social experiments. Survey data collected using these methods 
are subject to problems of complex survey methodology, departures from simple ran- 
dom sampling assumptions, and problems of sample selection, measurement errors, 
and incomplete, and/or missing data. Dealing with such issues in a way that can sup- 
port valid population inferences from the estimated econometric models population 
requires use of advanced methods. 

Finally, it is not unusual that two or more complications occur simultaneously, 
such as endogeneity in a logit model with panel data. Then a cookbook approach 
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becomes very difficult to implement. Instead, considerable understanding of the the- 
ory underlying the methods is needed, as the researcher may need to read econometrics 
journal articles and adapt standard econometrics software. 


1.2. Distinctive Aspects of Microeconometrics 


We now consider several advantages of microeconometrics that derive from its distinc- 
tive features. 


1.2.1. Discreteness and Nonlinearity 


The first and most obvious point is that microeconometric data are usually at a low 
level of aggregation. This has a major consequence for the functional forms used to 
analyze the variables of interest. In many, if not most, cases linear functional forms 
turn out to be simply inappropriate. More fundamentally, disaggregation brings to the 
forefront heterogeneity of individuals, firms, and organizations that should be prop- 
erly controlled (modeled) if one wants to make valid inferences about the underlying 
relationships. We discuss these issues in greater detail in the following sections. 

Although aggregation is not entirely absent in microdata, as for example when 
household- or establishment-level data are collected, the level of aggregation is usu- 
ally orders of magnitude lower than is common in macro analyses. In the latter case the 
process of aggregation leads to smoothing, with many of the movements in opposite 
directions canceling in the course of summation. The aggregated variables often show 
smoother behavior than their components, and the relationships between the aggre- 
gates frequently show greater smoothness than the components. For example, a rela- 
tion between two variables at a micro level may be piecewise linear with many nodes. 
After aggregation the relationship is likely to be well approximated by a smooth func- 
tion. Hence an immediate consequence of disaggregation is the absence of features of 
continuity and smoothness both of the variables themselves and of the relationships 
between them. 

Usually individual- and firm-level data cover a huge range of variation, both in the 
cross-section and time-series dimensions. For example, average weekly consumption 
of (say) beef is highly likely to be positive and smoothly varying, whereas that of an in- 
dividual household in a given week may be frequently zero and may also switch to pos- 
itive values from time to time. The average number of hours worked by female workers 
is unlikely to be zero, but many individual females have zero market hours of work 
(corner solutions), switching to positive values at other times in the course of their la- 
bor market history. Average household expenditure on vacations is usually positive, but 
many individual households may have zero expenditure on vacations in any given year. 
Average per capita consumption of tobacco products will usually be positive, but many 
individuals in the population have never consumed these products and never will, irre- 
spective of price and income considerations. As Pudney (1989) has observed, micro- 
data exhibit “holes, kinks and corners.” The holes correspond to nonparticipation in the 
activity of interest, kinks correspond to the switching behavior, and corners correspond 
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to the incidence of nonconsumption or nonparticipation at specific points of time. 
That is, discreteness and nonlinearity of response are intrinsic to microeconometrics. 

An important class of nonlinear models in microeconometrics deals with limited 
dependent variables (Maddala, 1983). This class includes many models that provide 
suitable frameworks for analyzing discrete responses and responses with limited range 
of variation. Such tools of analyses are of course also available for analyzing macro- 
data, if required. The point is that they are indispensable in microeconometrics and 
give it its distinctive feature. 


1.2.2. Greater Realism 


Macroeconometrics is sometimes based on strong assumptions; the representative 
agent assumption is a leading example. A frequent appeal is made to microeconomic 
reasoning to justify certain specifications and interpretations of empirical results. How- 
ever, it is rarely possible to say explicitly how these are affected by aggregation over 
time and micro units. Alternatively, very extreme aggregation assumptions are made. 
For example, aggregates are said to reflect the behavior of a hypothetical representative 
agent. Such assumptions also are not credible. 

From the viewpoint of microeconomic theory, quantitative analysis founded on 
microdata may be regarded as more realistic than that based on aggregated data. There 
are three justifications for this claim. First, the measurement of the variables involved 
in such hypotheses is often more direct (though not necessarily free from measurement 
error) and has greater correspondence to the theory being tested. Second, hypotheses 
about economic behavior are usually developed from theories of individual behavior. If 
these hypotheses are tested using aggregated data, then many approximations and sim- 
plifying assumptions have to be made. The simplifying assumption of a representative 
agent causes a great loss of information and severely limits the scope of an empirical 
investigation. Because such assumptions can be avoided in microeconometrics, and 
usually are, in principle the microdata provide a more realistic framework for testing 
microeconomic hypotheses. This is not a claim that the promise of microdata is nec- 
essarily achieved in empirical work. Such a claim must be assessed on a case-by-case 
basis. Finally, a realistic portrayal of economic activity should accommodate a broad 
range of outcomes and responses that are a consequence of individual heterogeneity 
and that are predicted by underlying theory. In this sense microeconomic data sets can 
support more realistic models. 

Microeconometric data are often derived from household or firm surveys, typically 
encompassing a wide range of behavior, with many of the behavioral outcomes tak- 
ing the form of discrete or categorical responses. Such data sets have many awkward 
features that call for special tools in the formulation and analysis that, although not 
entirely absent from macroeconometric work, nevertheless are less widely used. 


1.2.3. Greater Information Content 


The potential advantages of microdata sets can be realized if such data are informa- 
tive. Because sample surveys often provide independent observations on thousands of 
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cross-sectional units, such data are thought to be more informative than the standard, 
usually highly serially correlated, macro time series typically consisting of at most a 
few hundred observations. 

As will be explained in the next chapter, in practice the situation is not so clear-cut 
because the microdata may be quite noisy. At the individual level many (idiosyncratic) 
factors may play a large role in determining responses. Often these cannot be observed, 
leading one to treat them under the heading of a random component, which can be a 
very large part of observed variation. In this sense randomness plays a larger role in 
microdata than in macrodata. Of course, this affects measures of goodness of fit of the 
regressions. Students whose initial exposure to econometrics comes through aggregate 
time-series analysis are often conditioned to see large R? values. When encountering 
cross-section regressions for the first time, they express disappointment or even alarm 
at the “low explanatory power” of the regression equation. Nevertheless, there remains 
a strong presumption that, at least in certain dimensions, large microdata sets are highly 
informative. 

Another qualification is that when one is dealing with purely cross-section data, 
very little can be said about the intertemporal aspects of relationships under study. 
This particular aspect of behavior can be studied using panel and transition data. 

In many cases one is interested in the behavioral responses of a specific group of 
economic agents under some specified economic environment. One example is the 
impact of unemployment insurance on the job search behavior of young unemployed 
persons. Another example is the labor supply responses of low-income individuals 
receiving income support. Unless microdata are used such issues cannot be addressed 
directly in empirical work. 


1.2.4. Microeconomic Foundations 


Econometric models vary in the explicit role given to economic theory. At one end of 
the spectrum there are models in which the a priori theorizing may play a dominant 
role in the specification of the model and in the choice of an estimation procedure. At 
the other end of the spectrum are empirical investigations that make much less use of 
economic theory. 

The goal of the analysis in the first case is to identify and estimate fundamental 
parameters, sometimes called deep parameters, that characterize individual taste and 
preferences and/or technological relationships. As a shorthand designation, we call 
this the structural approach. Its hallmark is a heavy dependence on economic theory 
and emphasis on causal inference. Such models may require many assumptions, such 
as the precise specification of a cost or production function or specification of the 
distribution of error terms. The empirical conclusions of such an exercise may not 
be robust with respect to the departures from the assumptions. In Section 2.4.4 we 
shall say more about this approach. At the present stage we simply emphasize that if 
the structural approach is implemented with aggregated data, it will yield estimates 
of the fundamental parameters only under very stringent (and possibly unrealistic) 
conditions. Microdata sets provide a more promising environment for the structural 
approach, essentially because they permit greater flexibility in model specification. 
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The goal of the analysis in the second case is to model relationship(s) between re- 
sponse variables of interest conditionally on variables the researcher takes as given, or 
exogenous. More formal definitions of endogeneity and exogeneity are given in Chap- 
ter 2. As a shorthand designation, we call this a reduced form approach. The essential 
point is that reduced form analysis does not always take into account all causal inter- 
dependencies. A regression model in which the focus is on the prediction of y given 
regressors x, and not on the causal interpretation of the regression parameters, is often 
referred to as a reduced form regression. As will be explained in Chapter 2, in general 
the parameters of the reduced form model are functions of structural parameters. They 
may not be interpretable without some information about the structural parameters. 


1.2.5. Disaggregation and Heterogeneity 


It is sometimes said that many problems and issues of macroeconometrics arise from 
serial correlation of macro time series, and those of microeconometrics arise from 
heteroskedasticity of individual-level data. Although this is a useful characterization of 
the modeling effort in many microeconometric analyses, it needs amplification and is 
subject to important qualifications. In a range of microeconometric models, modeling 
of dynamic dependence may be an important issue. 

The benefits of disaggregation, which were emphasized earlier in this section, come 
at a cost: As the data become more disaggregated the importance of controlling for 
interindividual heterogeneity increases. Heterogeneity, or more precisely unobserved 
heterogeneity, plays a very important role in microeconometrics. Obviously, many 
variables that reflect interindividual heterogeneity, such as gender, race, educational 
background, and social and demographic factors, are directly observed and hence can 
be controlled for. In contrast, differences in individual motivation, ability, intelligence, 
and so forth are either not observed or, at best, imperfectly observed. 

The simplest response is to ignore such heterogeneity, that is, to absorb it into the 
regression disturbance. After all this is how one treats the myriad small unobserved 
factors. This step of course increases the unexplained part of the variation. More seri- 
ously, ignoring persistent interindividual differences leads to confounding with other 
factors that are also sources of persistent interindividual differences. Confounding is 
said to occur when the individual contributions of different regressors (predictor vari- 
ables) to the variation in the variable of interest cannot be statistically separated. Sup- 
pose, for example, that the factor x; (schooling) is said to be the source of variation in 
y (earnings), when another variable x2 (ability), which is another source of variation, 
does not appear in the model. Then that part of total variation that is attributable to 
the second variable may be incorrectly attributed to the first variable. Intuitively, their 
relative importances are confounded. A leading source of confounding bias is the in- 
correct omission of regressors from the model and the inclusion of other variables that 
are proxies for the omitted variable. 

Consider, for example, the case in which a program participation (0/1 dummy) 
variable D is included in the regression mean function with a vector of regressors x, 


y=xB+aD+u, (1.1) 
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where u is an error term. The term “treatment” is used in biological and experimental 
sciences to refer to an administered regimen involving participants in some trial. In 
econometrics it commonly refers to participation in some activity that may impact an 
outcome of interest. This activity may be randomly assigned to the participants or may 
be self-selected by the participant. Thus, although it is acknowledged that individuals 
choose their years of schooling, one still thinks of years of schooling as a “treatment” 
variable. Suppose that program participation is taken to be a discrete variable. The 
coefficient a of the “treatment variable” measures the average impact of the program 
participation (D = 1), conditional on covariates. If one does not control for unob- 
served heterogeneity, then a potential ambiguity affects the interpretation of the results. 
If d is found to have a significant impact, then the following question arises: Is œ sig- 
nificantly different from zero because D is correlated with some unobserved variable 
that affects y or because there is a causal relationship between D and y? For example, 
if the program considered is university education, and the covariates do not include a 
measure of ability, giving a fully causal interpretation becomes questionable. Because 
the issue is important, more attention should be given to how to control for unobserved 
heterogeneity. 

In some cases where dynamic considerations are involved the type of data available 
may put restrictions on how one can control for heterogeneity. Consider the example 
of two households, identical in all relevant respects except that one exhibits a sys- 
tematically higher preference for consuming good A. One could control for this by 
allowing individual utility functions to include a heterogeneity parameter that reflects 
their different preferences. Suppose now that there is a theory of consumer behavior 
that claims that consumers become addicted to good A, in the sense that the more they 
consume of it in one period, the greater is the probability that they will consume more 
of it in the future. This theory provides another explanation of persistent interindi- 
vidual differences in the consumption of good A. By controlling for heterogeneous 
preferences it becomes possible to test which source of persistence in consumption — 
preference heterogeneity or addiction — accounts for different consumption patterns. 
This type of problem arises whenever some dynamic element generates persistence 
in the observed outcomes. Several examples of this type of problem arise in various 
places in the book. 

A variety of approaches for modeling heterogeneity coexist in microeconometrics. 
A brief mention of some of these follows, with details postponed until later. 

An extreme solution is to ignore all unobserved interindividual differences. If unob- 
served heterogeneity is uncorrelated with observed heterogeneity, and if the outcome 
being studied has no intertemporal dependence, then the aforementioned problems will 
not arise. Of course, these are strong assumptions and even with these assumptions not 
all econometric difficulties disappear. 

One approach for handling heterogeneity is to treat it as a fixed effect and to esti- 
mate it as a coefficient of an individual specific 0/1 dummy variable. For example, in 
a cross-section regression, each micro unit is allowed its own dummy variable (inter- 
cept). This leads to an extreme proliferation of parameters because when a new individ- 
ual is added to the sample, a new intercept parameter is also added. Thus this approach 
will not work if our data are cross sectional. The availability of multiple observations 
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per individual unit, most commonly in the form of panel data with T time-series ob- 
servations for each of the N cross-section units, makes it possible to either estimate 
or eliminate the fixed effect, for example by first differencing if the model is linear 
and the fixed effect is additive. If the model is nonlinear, as is often the case, the fixed 
effect will usually not be additive and other approaches will need to be considered. 

A second approach to modeling unobserved heterogeneity is through a random ef- 
fects model. There are a number of ways in which the random effects model can be 
formulated. One popular formulation assumes that one or more regression parameters, 
often just the regression intercept, varies randomly across the cross section. In another 
formulation the regression error is given a component structure, with an individual 
specific random component. The random effects model then attempts to estimate the 
parameters of the distribution from which the random component is drawn. In some 
cases, such as demand analysis, the random term can be interpreted as random prefer- 
ence variation. Random effects models can be estimated using either cross-section or 
panel data. 


1.2.6. Dynamics 


A very common assumption in cross-section analysis is the absence of intertempo- 
ral dependence, that is, an absence of dynamics. Thus, implicitly it is assumed that 
the observations correspond to a stochastic equilibrium, with the deviation from the 
equilibrium being represented by serially independent random disturbances. Even in 
microeconometrics for some data situations such an assumption may be too strong. 
For example, it is inconsistent with the presence of serially correlated unobserved het- 
erogeneity. Dependence on lagged dependent variables also violates this assumption. 

The foregoing discussion illustrates some of the potential limitations of a single 
cross-section analysis. Some limitations may be overcome if repeated cross sections 
are available. However, if there is dynamic dependence, the least problematic approach 
might well be to use panel data. 


1.3. Book Outline 


The book is split into six parts. Part 1 presents the issues involved in microeconometric 
modeling. Parts 2 and 3 present general theory for estimation and statistical inference 
for nonlinear regression models. Parts 4 and 5 specialize to the core models used in 
applied microeconometrics for, respectively, cross-section and panel data. Part 6 covers 
broader topics that make considerable use of material presented in the earlier chapters. 

The book outline is summarized in Table 1.1. The remainder of this section details 
each part in turn. 


1.3.1. Part 1: Preliminaries 


Chapters 2 and 3 expand on the special features of the microeconometric approach 
to modeling and microeconomic data structures within the more general statistical 
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Table 1.1. Book Outline 


Part and Chapter 


Background’ 


Example 


1. Preliminaries 
1. Overview 
2. Causal and Noncausal Models 
3. Microeconomic Data 
Structures 
2. Core Methods 
4. Linear Models 
5. Maximum Likelihood and 
Nonlinear Least-Squares 
Estimation 
. Generalized Method of 
Moments and Systems 
Estimation 
. Hypothesis Tests 


fon 


~ 


o0 


. Specification Tests and Model 
Selection 
9. Semiparametric Methods 
10. Numerical Optimization 
3. Simulation-Based Methods 
11. Bootstrap Methods 
12. Simulation-Based Methods 
13. Bayesian Methods 
4. Models for Cross-Section Data 
14. Binary Outcome Models 
15. Multinomial Models 


16. Tobit and Selection Models 
17. Transition Data: Survival 
Analysis 
18. Mixture Models and 
Unobserved Heterogeneity 
19. Models for Multiple Hazards 
20. Models of Count Data 
5. Models for Panel Data 
21. Linear Panel Models: Basics 
22. Linear Panel Models: 
Extensions 
23. Nonlinear Panel Models 
6. Further Topics 
24. Stratified and Clustered 
Samples 
25. Treatment Evaluation 
26. Measurement Error Models 


27. Missing Data and Imputation 


5,7 


6,21 


5,6,21,22 


5,21 


Simultaneous equations models 
Observational data 


Ordinary least squares 
m-estimation or extremum 
estimation 


Instrumental variables 


Wald, score, and likelihood ratio 
tests 
Conditional moment test 


Kernel regression 
Newton—Raphson iterative method 


Percentile t-method 
Maximum simulated likelihood 
Markov chain Monte Carlo 


Logit, probit for y = (0, 1) 

Multinomial logit for 
y=(,..,m) 

Tobit for y = max(y*, 0) 

Cox proportional hazards for 
y = min(y*, c) 

Unobserved heterogeneity 


Multiple hazards 
Poisson for y = 0, 1, 2,... 


Fixed and random effects 

Dynamic and endogenous 
regressors 

Panel logit, Tobit, and Poisson 


Data (y;;, Xij) correlated over j 


Regressor d = | if in program 

Logit model with measurement 
errors 

Regression with missing 
observations 


^ The background gives the essential chapter needed in addition to the treatment of ordinary and weighted LS in 
Chapter 4. Note that the first panel data chapter (Chapter 21) requires only Chapter 4. 
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arena of regression analysis. Many of the issues raised in these chapters are pursued 
throughout the book as the reader develops the necessary tools. 


1.3.2. Part 2: Core Methods 


Chapters 4—10 detail the main general methods used in classical estimation and sta- 
tistical inference. The results given in Chapter 5 in particular are extensively used 
throughout the book. 

Chapter 4 presents some results for the linear regression model, emphasizing those 
issues and methods that are most relevant for the rest of the book. Analysis is relatively 
straightforward as there is an explicit expression for linear model estimators such as 
ordinary least squares. 

Chapters 5 and 6 present estimation theory that can be applied to nonlinear models 
for which there is usually no explicit solution for the estimator. Asymptotic theory 
is used to obtain the distribution of estimators, with emphasis on obtaining robust 
standard error estimates that rely on relatively weak distributional assumptions. A quite 
general treatment of estimation, along with specialization to nonlinear least-squares 
and maximum likelihood estimation, is presented in Chapter 5. The more challenging 
generalized method of moments estimator and specialization to instrumental variables 
estimation are given separate treatment in Chapter 6. 

Chapter 7 presents classical hypothesis testing when estimators are nonlinear and 
the hypothesis being tested is possibly nonlinear in parameters. Specification tests in 
addition to hypothesis tests are the subject of Chapter 8. 

Chapter 9 presents semiparametric estimation methods such as kernel regression. 
The leading example is flexible modeling of the conditional mean. For the patents ex- 
ample, the nonparametric regression model is E[y|x] = g(x), where the function g(-) 
is unspecified and is instead estimated. Then estimation has an infinite-dimensional 
component g(-) leading to a nonstandard asymptotic theory. With additional regres- 
sors some further structure is needed and the methods are called semiparametric or 
seminonparametric. 

Chapter 10 presents the computational methods used to compute a parameter esti- 
mate when the estimator is defined implicitly, usually as the solution to some first-order 
conditions. 


1.3.3. Part 3: Simulation-Based Methods 


Chapters 11—13 consider methods of estimation and inference that rely on simulation. 
These methods are generally more computationally intensive and, currently, less uti- 
lized than the methods presented in Part 2. 

Chapter 11 presents the bootstrap method for statistical inference. This yields the 
empirical distribution of an estimator by obtaining new samples by simulation, such 
as by repeated resampling with replacement from the original sample. The bootstrap 
can provide a simple way to obtain standard errors when the formulas from asymp- 
totic theory are complex, as is the case for some two-step estimators. Furthermore, if 


12 


1.3. BOOK OUTLINE 


implemented appropriately, the bootstrap can lead to better statistical inference in 
small samples. 

Chapter 12 presents simulation-based estimation methods, developed for models 
that involve an integral over a probability distribution for which there is no closed- 
form solution. Estimation is still possible by making multiple draws from the relevant 
distribution and averaging. 

Chapter 13 presents Bayesian methods, which combine a distribution for the ob- 
served data with a specified prior distribution for parameters to obtain a posterior dis- 
tribution of the parameters that is the basis for estimation. Recent advances make com- 
putation possible even if there is no closed-form solution for the posterior distribution. 
Bayesian analysis can provide an approach to estimation and inference that is quite dif- 
ferent from the classical approach. However, in many cases only the Bayesian tool kit 
is adopted to permit classical estimation and inference for problems that are otherwise 
intractable. 


1.3.4. Part 4: Models for Cross-Section Data 


Chapters 14—20 present the main nonlinear models for cross-section data. This part is 
the heart of the book and presents advanced topics such as models for limited depen- 
dent variables and sample selection. The classes of models are defined by the range of 
values taken by the dependent variable. 

Binary data models for dependent variable that can take only two possible values, 
say y = Oor y = 1, are presented in Chapter 14. In Chapter 15 an extension is made to 
multinomial models, for dependent variable that takes several discrete values. Exam- 
ples include employment status (employed, unemployed, and out of the labor force) 
and mode of transportation to work (car, bus, or train). Linear models can be informa- 
tive but are not appropriate, as they can lead to predicted probabilities outside the unit 
interval. Instead logit, probit, and related models are used. 

Chapter 16 presents models with censoring, truncation, sample selection. Exam- 
ples include annual hours of work, conditional on choosing to work, and hospital ex- 
penditures, conditional on being hospitalized. In these cases the data are incompletely 
observed with a bunching of observations at y = 0 and with the remaining y > 0. 
The model for the observed data can be shown to be nonlinear even if the underlying 
process is linear, and linear regression on the observed data can be very misleading. 
Simple corrections for censoring, truncation, or sample selection such as the Tobit 
model exist, but these are very dependent on distributional assumptions. 

Models for duration data are presented in Chapters 17-19. An example is length 
of unemployment spell. Standard regression models include the exponential, Weibull, 
and Cox proportional hazards model. Additionally, as in Chapter 16, the dependent 
variable is often incompletely observed. For example, the data may be on the length of 
a current spell that is incomplete, rather than the length of a completed spell. 

Chapter 20 presents count data models. Examples include various measures of 
health utilization such as number of doctor visits and number of days hospitalized. 
Again the model is nonlinear, as counts and hence the conditional mean are nonnega- 
tive. Leading parametric models include the Poisson and negative binomial. 
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1.3.5. Part 5: Models for Panel Data 


Chapters 21-23 present methods for panel data. Here the data are observed in several 
time periods for each of the many individuals in the sample, so the dependent variable 
and regressors are indexed by both individual and time. Any analysis needs to control 
for the likely positive correlation of error terms in different time periods for a given in- 
dividual. Additionally, panel data can provide sufficient data to control for unobserved 
time-invariant individual-specific effects, permitting identification of causation under 
weaker assumptions than those needed if only cross-section data are available. 

The basic linear panel data model is presented in Chapter 21, with emphasis on 
fixed effects and random effects models. Extensions of linear models to permit lagged 
dependent variables and endogenous regressors are presented in Chapter 22. Panel 
methods for the nonlinear models of Part 4 are presented in Chapter 23. 

The panel data methods are placed late in the book to permit a unified self-contained 
treatment. Chapter 21 could have been placed immediately after Chapter 4 and is writ- 
ten in an accessible manner that relies on little more than knowledge of least-squares 
estimation. 


1.3.6. Part 6: Further Topics 


This part considers important topics that can generally relate to any and all models 
covered in Parts 4 and 5. Chapter 24 deals with modeling of clustered data in sev- 
eral different models. Chapter 25 discusses treatment evaluation. Treatment evaluation 
is a general term that can cover a wide variety of models in which the focus is on 
measuring the impact of some “treatment” that is either exogenously or randomly as- 
signed to an individual on some measure of interest, denoted an “outcome variable.” 
Chapter 26 deals with the consequences of measurement errors in outcome and/or 
regressor variables, with emphasis on some leading nonlinear models. Chapter 27 
considers some methods of handling missing data in linear and nonlinear regression 
models. 


1.4. How to Use This Book 


The book assumes a basic understanding of the linear regression model with matrix 
algebra. It is written at the mathematical level of the first-year economics Ph.D. se- 
quence, comparable to Greene (2003). 

Although some of the material in this book is covered in a first-year sequence, 
most of it appears in second-year econometrics Ph.D. courses or in data-oriented mi- 
croeconomics field courses such as labor economics, public economics, or industrial 
organization. This book is intended to be used as both an econometrics text and as an 
adjunct for such field courses. More generally, the book is intended to be useful as a 
reference work for applied researchers in economics, in related social sciences such as 
sociology and political science, and in epidemiology. 

For readers using this book as a reference work, the models chapters have been 
written to be as self-contained as possible. For the specific models presented in Parts 4 
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Table 1.2. Outline of a 20-Lecture 10-Week Course 


Lectures Chapter Topic 

1-3 4, Appx. A Review of linear models and asymptotic theory 
4-7 5 Estimation: m-estimation, ML, and NLS 

8 10 Estimation: numerical optimization 

9-11 14, 15 Models: binary and multinomial 

12-14 16 Models: censored and truncated 

15 6 Estimation: GMM 

16 TJ Testing: hypothesis tests 

17-19 21 Models: basic linear panel 

20 9 Estimation: semiparametric 


and 5 it will generally be sufficient to read the relevant chapter in isolation, except 
that some command of the general estimation results in Chapter 5 and in some cases 
Chapter 6 will be necessary. Most chapters are structured to begin with a discussion 
and example that is accessible to a wide audience. 

For instructors using this book as a course text it is best to introduce the basic non- 
linear cross-section and linear panel data models as early as possible, skipping many 
of the methods chapters. The most commonly used nonlinear cross-section models 
are presented in Chapters 14-16; these require knowledge of maximum likelihood 
and least-squares estimation, presented in Chapter 5. Chapter 21 on linear panel data 
models requires even less preparation, essentially just Chapter 4. 

Table 1.2 provides an outline for a one-quarter second-year graduate course taught 
at the University of California, Davis, immediately following the required first-year 
statistics and econometrics sequence. A quarter provides sufficient time to cover the 
basic results given in the first half of the chapters in this outline. With additional time 
one can go into further detail or cover a subset of Chapters 11-13 on computation- 
ally intensive estimation methods (simulation-based estimation, the bootstrap, which 
is also briefly presented in Chapter 7, and Bayesian methods); additional cross-section 
models (durations and counts) presented in Chapters 17—20; and additional panel data 
models (linear model extensions and nonlinear models) given in Chapters 22 and 23. 

At Indiana University, Bloomington, a 15-week semester-long field course in mi- 
croeconometrics is based on material in most of Parts 4 and 5. The prerequisite courses 
for this course cover material similar to that in Part 2. 

Some exercises are provided at the end of each chapter after the first three intro- 
ductory chapters. These exercises are usually learning-by-doing exercises; some are 
purely methodological whereas others entail analysis of generated or actual data. The 
level of difficulty of the questions is mostly related to the level of difficulty of the topic. 


1.5. Software 


There are many software packages available for data analysis. Popular packages with 
strong microeconometric capabilities include LIMDEP, SAS, and STATA, all of which 
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offer an impressive range of canned routines and additionally support user-defined pro- 
cedures using a matrix programming language. Other packages that are also widely 
used include EVIEWS, PCGIVE, and TSP. Despite their time-series orientation, these 
can support some cross-section data analysis. Users who wish to do their own pro- 
gramming also have available a variety of options including GAUSS, MATLAB, OX, 
and SAS/IML. The latest detailed information about these packages and many others 
can be efficiently located via an Internet browser and a search engine. 


1.6. Notation and Conventions 


Vector and matrix algebra are used extensively. 

Vectors are defined as column vectors and represented using lowercase bold. For 
example, for linear regression the regressor vector x is a K x 1 column vector with jth 
entry x; and the parameter vector 6 isa K x 1 column vector with jth entry £j, so 


X1 By 
x =|: and B =|: 
(K x 1) XK (K x 1) Bx 
Then the linear regression model y = 6,x; + 2x2 +---+ gxr + u is expressed as 


y = XB + u. At times a subscript i is added to denote the typical ith observation. The 
linear regression equation for the ith observation is then 


Yi =x, B+ ui. 


The sample is one of N observations, {(y;, X;), i = 1,..., N}. In this book observa- 
tions are usually assumed to be independent over i. 

Matrices are represented using uppercase bold. In matrix notation the sample is 
(y, X), where y is an N x 1 vector with ith entry y; and X is a matrix with ith row x‘, 
so 


yı X] 
y Ih os and X = 


(N x 1) Yy (N x dim(x)) / 


The linear regression model upon stacking all N observations is then 
y=X +u, 


where u is an N x 1 column vector with ith entry u;. 

Matrix notation is compact but at times it is clearer to write products of matrices 
as summations of products of vectors. For example, the OLS estimator can be equiva- 
lently written in either of the following ways: 


a N ly 
B = (XX) 'Xy= (> xx) yxy. 
i=1 


i=1 
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Table 1.3. Commonly Used Acronyms and Abbreviations 


OLS 
GLS 
FGLS 
IV 
2SLS 
3SLS 


NLS 
FGNLS 
Nonlinear NIV 
NL2SLS 
NL3SLS 


LS 

ML 

General QML 
GMM 
GEE 


Linear 


Ordinary least squares 
Generalized least squares 
Feasible generalized least squares 
Instrumental variables 

Two-stage least squares 
Three-stage least squares 


Nonlinear least squares 

Feasible generalized nonlinear least squares 
Nonlinear instrumental variables 

Nonlinear two-stage least squares 
Nonlinear three-stage least squares 


Least squares 

Maximum likelihood 
Quasi-maximum likelihood 
Generalized method of moments 
Generalized estimating equations 


Generic notation for a parameter is the q x 1 vector 0. The regression parameters 
are represented by the K x 1 vector 3, which may equal @ or may be a subset of 0 


depending on the context. 


The book uses many abbreviations and acronyms. Table 1.3 summarizes abbrevia- 
tions used for some common estimation methods, ordered by whether the estimator is 
developed for linear or nonlinear regression models. We also use the following: dgp 
(data-generating process), iid (independently and identically distributed), pdf (prob- 
ability density function), cdf (cumulative distribution function), L (likelihood), In L 


(log-likelihood), FE (fixed effects), and RE (random effects). 
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CHAPTER 2 


Causal and Noncausal Models 


2.1. Introduction 


Microeconometrics deals with the theory and applications of methods of data analysis 
developed for microdata pertaining to individuals, households, and firms. A broader 
definition might also include regional- and state-level data. Microdata are usually 
either cross sectional, in which case they refer to conditions at the same point in 
time, or longitudinal (panel) in which case they refer to the same observational units 
over several periods. Such observations are generated from both nonexperimental 
setups, such as censuses and surveys, and quasi-experimental or experimental setups, 
such as social experiments implemented by governments with the participation of 
volunteers. 

A microeconometric model may be a full specification of the probability distribu- 
tion of a set of microeconomic observations; it may also be a partial specification of 
some distributional properties, such as moments, of a subset of variables. The mean of 
a single dependent variable conditional on regressors is of particular interest. 

There are several objectives of microeconometrics. They include both data descrip- 
tion and causal inference. The first can be defined broadly to include moment prop- 
erties of response variables, or regression equations that highlight associations rather 
than causal relations. The second category includes causal relationships that aim at 
measurement and/or empirical confirmation or refutation of conjectures and proposi- 
tions regarding microeconomic behavior. The type and style of empirical investigations 
therefore span a wide spectrum. At one end of the spectrum can be found very highly 
structured models, derived from detailed specification of the underlying economic be- 
havior, that analyze causal (behavioral) or structural relationships for interdependent 
microeconomic variables. At the other end are reduced form studies that aim to un- 
cover correlations and associations among variables, without necessarily relying on 
a detailed specification of all relevant interdependencies. Both approaches share the 
common goal of uncovering important and striking relationships that could be helpful 
in understanding microeconomic behavior, but they differ in the extent to which they 
rely on economic theory to guide their empirical investigations. 
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As a subdiscipline microeconometrics is newer than macroeconometrics, which is 
concerned with modeling of market and aggregate data. A great deal of the early 
work in applied econometrics was based on aggregate time-series data collected by 
government agencies. Much of the early work on statistical demand analysis up until 
about 1940 used market rather than individual or household data (Hendry and Morgan, 
1996). Morgan’s (1990) book on the history of econometric ideas makes no reference 
to microeconometric work before the 1940s, with one important exception. That ex- 
ception is the work on household budget data that was instigated by concern with the 
living standards of the less well-off in many countries. This led to the collection of 
household budget data that provided the raw material for some of the earlier microe- 
conometric studies such as those pioneered by Allen and Bowley (1935). Nevertheless, 
it is only since the 1950s that microeconometrics has emerged as a distinctive and rec- 
ognized subdiscipline. Even into the 1960s the core of microeconometrics consisted 
of demand analyses based on household surveys. 

With the award of the year 2000 Nobel Prize in Economics to James Heckman 
and Daniel McFadden for their contributions to microeconometrics, the subject area 
has achieved clear recognition as a distinct subdiscipline. The award cited Heckman 
“for his development of theory and methods for analyzing selective samples” and 
McFadden “for his development of theory and methods for analyzing discrete choice.” 
Examples of the type of topics that microeconometrics deals with were also men- 
tioned in the citation: “... what factors determine whether an individual decides to 
work and, if so, how many hours? How do economic incentives affect individual 
choices regarding education, occupation or place of residence? What are the effects 
of different labor-market and educational programs on an individual’s income and 
employment?” 

Applications of microeconometric methods can be found not only in every area of 
microeconomics but also in other cognate social sciences such as political science, 
sociology, and geography. 

Beginning with the 1970s and especially within the past two decades revolution- 
ary advances in our capacity for handling large data sets and associated computations 
have taken place. These, together with the accompanying explosion in the availability 
of large microeconomic data sets, have greatly expanded the scope of microecono- 
metrics. As a result, although empirical demand analysis continues to be one of the 
most important areas of application for microeconometric methods, its style and con- 
tent have been heavily influenced by newer methods and models. Further, applications 
in economic development, finance, health, industrial organization, labor and public 
economics, and applied microeconomics generally are now commonplace, and these 
applications will be encountered at various places in this book. 

The primary focus of this book is on the newer material that has emerged in the 
past three decades. Our goal is to survey concepts, models, and methods that we re- 
gard as standard components of a modern microeconometrician’s tool kit. Of course, 
the notion of standard methods and models is inevitably both subjective and elastic, 
being a function of the presumed clientele of this book as well as the authors’ own 
backgrounds. There may also be topics we regard as too advanced for an introductory 
book such as this that others would place in a different category. 
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Microeconometrics focuses on the complications of nonlinear models and on ob- 
taining estimates that can be given a structural interpretation. Much of this book, es- 
pecially Parts 2—4, presents methods for nonlinear models. These nonlinear methods 
overlap with many areas of applied statistics including biostatistics. By contrast, the 
distinguishing feature of econometrics is the emphasis placed on causal modeling. 
This chapter introduces the key concepts related to causal (and noncausal) modeling, 
concepts that are germane to both linear and nonlinear models. 

Sections 2.2 and 2.3 introduce the key concepts of structure and exogeneity. 
Section 2.4 uses the linear simultaneous equations model as a specific illustration 
of a structural model and connects it with the other important concepts of reduced 
form models. Identification definitions are given in Section 2.5. Section 2.6 considers 
single-equation structural models. Section 2.7 introduces the potential outcome model 
and compares the causal parameters and interpretations in the potential outcome model 
with those in the simultaneous equations model. Section 2.8 provides a brief discus- 
sion of modeling and estimation strategies designed to handle computational and data 
challenges. 


2.2. Structural Models 


Structure consists of 


1. a set of variables W (“data”) partitioned for convenience as [Y Z]; 
2. a joint probability distribution of W, F(W); 


3. an a priori ordering of W according to hypothetical cause-and-effect relationships and 
specification of a priori restrictions on the hypothesized model; and 


4. a parametric, semiparametric, or nonparametric specification of functional forms and 
the restrictions on the parameters of the model. 


This general description of a structural model is consistent with a well-established 
Cowles Commission definition of a structure. For example, Sargan (1988, p. 27) states: 


A model is the specification of the probability distribution for a set of observations. 
A structure is the specification of the parameters of that distribution. Therefore, a 
structure is a model in which all the parameters are assigned numerical values. 


We consider the case in which the modeling objective is to explain the values of 
observable vector-valued variable y, y’ = (y1, ... , yg). Each element of y is a func- 
tion of some other elements of y and of explanatory variables z and a purely random 
disturbance u. Note that the variables y are assumed to be interdependent. By contrast, 
interdependence between z; is not modeled. The ith observation satisfies the set of 
implicit equations 


g (yi, z, u;10) = 0, (2.1) 


where g is a known function. We refer to this as the structural model, and we refer to 
0 as structural parameters. This corresponds to property 4 given earlier in this section. 
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Assume that there is a unique solution for y; for every (z;, u;). Then we can write 
the equations in an explicit form with y as function of (z, u): 


yi = f (z;, u; |7). (2.2) 


This is referred to as the reduced form of the structural model, where 7 is a vector 
of reduced form parameters that are functions of 0. The reduced form is obtained 
by solving the structural model for the endogenous variables y;, given (z;, u;). The 
reduced form parameters 7 are functions of 0. 

If the objective of modeling is inference about elements of 0, then (2.1) provides a 
direct route. This involves estimation of the structural model. However, because ele- 
ments of m are functions of 0, (2.2) also provides an indirect route to inference on 0. 
If f(z;, u; |m) has a known functional form, and if it is additively separable in z; and u,, 
such that we can write 


yi = g (zi|7) + w = Ely;|z;] + u;, (2.3) 


then the regression of y on Z is a natural prediction function for y given z. In this 
sense the reduced form equation has a useful role for making conditional predictions 
of y; given (z;, u;). To generate predictions of the left-hand-side variable for assigned 
values of the right-hand-side variables of (2.2) requires estimates of m, which may be 
computationally simpler. 

An important extension of (2.3) is the transformation model, which for scalar y 
takes the form 


Ay) =Z T +u, (2.4) 


where A(y) is a transformation function (e.g., A(y) = In(y) or A(y) = y!/”). In some 
cases the transformation function may depend on unknown parameters. A transfor- 
mation model is distinct from a regression, but it too can be used to make estimates 
of E[y|z]. An important example is the accelerated failure time model analyzed in 
Chapter 17. 

One of the most important, and potentially controversial, steps in the specification 
of the structural model is property 3, in which an a priori ordering of variables into 
causes and effects is assigned. In essence this involves drawing a distinction between 
those variables whose variation the model is designed to explain and those whose 
variation is externally determined and hence lie outside the scope of investigation. In 
microeconometrics, examples of the former are years of schooling and hours worked; 
examples of the latter are gender, ethnicity, age, and similar demographic variables. 
The former, denoted y, are referred to as endogenous and the latter, denoted z, are 
called exogenous variables. 

Exogeneity of a variable is an important simplification because in essence it jus- 
tifies the decision to treat that variable as ancillary, and not to model that variable 
because the parameters of that relationship have no direct bearing on the variable 
under study. This important notion needs a more formal definition, which we now 
provide. 
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2.3. Exogeneity 


We begin by considering the representation of a general finite dimensional parametric 
case in which the joint distribution of W, with parameters @ partitioned as (0; 02), is 
factored into the conditional density of Y given Z, and the marginal distribution of Z, 
giving 


fı (W10) = fc (¥|Z, 0) x fu (Z0). (2.5) 
A special case of this result occurs if 
fı (WIO) = fc XZ, 01) x fu (Z103), 


where 0; and 0, are functionally independent. Then we say that Z is exogenous with 
respect to 04; this means that knowledge of fi,(Z|@,) is not required for inference on 
6, and hence we can validly condition the distribution of Y on Z. 

Models can always be reparameterized. So next consider the case in which the 
model is reparameterized in terms of parameters y, with one-to-one transformation 
of 0, say p = h(0), where ¢ is partitioned into (41, 22). This reparametrization may 
be of interest if, for example, 4 is structurally invariant to a class of policy interven- 
tions. Suppose y; is the parameter of interest. In such a case one is interested in the 
exogeneity of Z with respect to y4. Then, the condition for exogeneity is that 


fi Wip) = fe (YIZ, p1) x fu (Zez), (2.6) 


where y, is independent of p3. 

Finally consider the case in which the interest is in a parameter A that is a function 
of y, say h(y). Then for exogeneity of Z with respect to 4, we need two conditions: 
(i) à depends only on ¢, i.e., A = h(~,), and hence only the conditional distribution is 
of interest; and (ii) pı and ~, are “variation free” which means that the parameters of 
the joint distribution are not subject to cross-restrictions, i.e. (91, ~2) E€ Bi x ® = 
{yp € ®, P E Do}. 

The factorization in (2.5)-(2.6) plays an important role in the development of the 
exogeneity concept. Of special interest in this book are the following three con- 
cepts related to exogeneity: (1) weak exogeneity; (2) Granger noncausality; (3) strong 
exogeneity. 


Definition 2.1 (Weak Exogeneity): Z is weakly exogenous for à if (i) and (ii) 
hold. 


If the marginal model parameters are uninformative for inference on A, then infer- 
ence on A can proceed on the basis of the conditional distribution f(Y|Z, ~,) alone. 
The operational implication is that weakly exogenous variables can be taken as given 
if one’s main interest is in inference on A or 44. This does not mean that there is no 
statistical model for Z; it means that the parameters of that model play no role in the 
inference on ~,, and hence are irrelevant. 
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2.3.1. Conditional Independence 


Originally, the Granger causality concept was defined in the context of prediction in a 
time-series environment. More generally, it can be interpreted as a form of conditional 
independence (Holland, 1986, p. 957). 

Partition z into two subsets Z; and z2; let W = [y, z,, Z2] be the matrices of vari- 
ables of interest. Then z; and y are conditionally independent given z3 if 


f lzi, z2) = f (ylz2). (2.8) 
This is stronger than the mean independence assumption, which would imply 
E[y|z1, z2] = E [y|z2]. (2.9) 


Then z; has no predictive value for y, after conditioning on z2. In a predictive sense 
this means that zı does not Granger-cause y. 

In a time-series context, Zi and Z? would be mutually exclusive lagged values of 
subsets of y. 


Definition 2.2 (Strong Exogeneity): zı is strongly exogenous for ¢ if it is 
weakly exogenous for y and does not Granger-cause y so (2.8) holds. 


2.3.2. Exogenizing Variables 


Exogeneity is a strong assumption. It is a property of random variables relative to 
parameters of interest. Hence a variable may be validly treated as exogenous in one 
structural model but not in another; the key issue is the parameters that are the subject 
of inference. Arbitrary imposition of this property will have some undesirable conse- 
quences that will be discussed in Section 2.4. 

The exogeneity assumption may be justified by a priori theorizing, in which case it 
is a part of the maintained hypothesis of the model. It may in some cases be justified 
as a valid approximation, in which case it may be subject to testing, as discussed in 
Section 8.4.3. In cross-section analysis it may be justified as being a consequence of 
a natural experiment or a quasi-experiment in which the value of the variable is de- 
termined by an external intervention; for example, government or regulatory authority 
may determine the setting of a tax rate or a policy parameter. Of special interest is the 
case in which an external intervention results in a change in the value of an impor- 
tant policy variable. Such a natural experiment is tantamount to exogenization of some 
variable. As we shall see in Chapter 3, this creates a quasi-experimental opportunity to 
study the impact of a variable in the absence of other complicating factors. 


2.4. Linear Simultaneous Equations Model 


An important special case of the general structural model specified in (2.1) is the linear 
simultaneous equation model developed by the Cowles Commission econometricians. 
Comprehensive treatment of this model is available in many textbooks (e.g., Sargan, 
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1988). The treatment here is brief and selective; also see Section 6.9.6. The objective is 
to bring into the discussion several key ideas and concepts that have more general rele- 
vance. Although the analysis is restricted to linear models, many insights are routinely 
applied to nonlinear models. 


2.4.1. The SEM Setup 
The linear simultaneous equations model (SEM) setup is as follows: 


yuBu t+: + yGiPig +2 Fe + ZKkiyiKk = Ui 


yuiboi +-:-+yoibog + Zui yai +++ + 2KiV¥GK = UGi, 


where i is the observation subscript. 

A clear a priori distinction or preordering is made between endogenous variables, 
y; = (i, ---, YGi), and exogenous variables, Z; = (z1;, ..., Zxi). By definition the ex- 
ogenous variables are uncorrelated with the purely random disturbances (u1;, ..., UGi). 
In its unrestricted form every variable enters every equation. 

In matrix notation, the G-equation SEM for the ith equation is written as 


yB +zT =u, (2.10) 


where y;, B, z;, T, and u; have dimensions G x 1, G x G, K x 1, K x G,andG x 1, 
respectively. For specified values of (B, T) and (z;, u;) G linear simultaneous equa- 
tions can in principle be solved for y;. 

The standard assumptions of SEM are as follows: 


. B is nonsingular and has rank G. 
. rank[Z] = K. The N x K matrix Z is formed by stacking z;, i = 1,..., N. 


. plim N-!Z/Z = ©» is a symmetric K x K positive definite matrix. 


A U N = 


. u; ~ N[O, £]; that is, E[u;] = 0 and E[u;u;] = © =[0;;j], where X is a symmetric 
G x G positive definite matrix. 


5. The errors in each equation are serially independent. 
In this model the structure (or structural parameters) consists of (B, T, X). Writing 


/ 
/ / u; 


Yn Zy , 
Uy 


allows us to express the structural model more compactly as 
YB +ZT =U, (2.11) 


where the arrays Y, B, Z, T, and U have dimensions N x G, G x G, N x K, K x 
G, and N x G, respectively. Solving for all the endogenous variables in terms of all 
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the exogenous variables, we obtain the reduced form of the SEM: 


Y+ ZTB! = UB", 
Y=ZT1+V, (2.12) 


where II = —I'B™! and V = UB". Given Assumption 4, v; ~ N’[0, Bo SB" !]. 

In the SEM framework the structural model has primacy for several reasons. First, 
the equations themselves have interpretations as economic relationships such as de- 
mand or supply relations, production functions, and so forth, and they are subject to 
restrictions of economic theory. Consequently, B and T are parameters that describe 
economic behavior. Hence a priori theory can be invoked to form expectations about 
the sign and size of individual coefficients. By contrast, the unrestricted reduced form 
parameters are potentially complicated functions of the structural parameters, and as 
such it may be difficult to evaluate them postestimation. This consideration may have 
little weight if the goal of econometric modeling is prediction rather than inference on 
parameters with behavioral interpretation. 

Consider, without loss of generality, the first equation in the model (2.11), with yı 
as the dependent variable. In addition, some of the remaining G — 1 endogenous vari- 
ables and K — 1 exogenous variables may be absent from this equation. From (2.12) 
we see that in general the endogenous variables Y depend stochastically on V, which 
in turn is a function of the structural errors U. Therefore, in general plim N~!Y'U + 0. 
Generally, the application of the least-squares estimator in the simultaneous equation 
setting yields inconsistent estimates. This is a well-known and basic result from the si- 
multaneous equations literature, often referred to as the “simultaneous equations bias” 
problem. The vast literature on simultaneous equations models deals with identifica- 
tion and consistent estimation when the least-squares approach fails; see Sargan (1988) 
and Schmidt (1976), and Section 6.9.6. 

The reduced form of SEM expresses every endogenous variable as a linear function 
of all exogenous variables and all structural disturbances. The reduced form distur- 
bances are linear combinations of the structural disturbances. From the reduced form 
for the ith observation 


E [y;|z;] = zT, (2.13) 
V[ylz]= Q = BSB". (2.14) 


The reduced form parameters II are derived parameters defined as functions of the 
structural parameters. If II can be consistently estimated then the reduced form can 
be used to make predictive statements about variations in Y due to exogenous changes 
in Z. This is possible even if B and T are not known. Given the exogeneity of Z, 
the full set of reduced form regressions is a multivariate regression model that can be 
estimated consistently by least squares. The reduced form provides a basis for making 
conditional predictions of Y given Z. 

The restricted reduced form is the unrestricted reduced form model subject to re- 
strictions. If these are the same restrictions as those that apply to the structure, then 
structural information can be recovered from the reduced form. 
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In the SEM framework, the unknown structural parameters, the nonzero elements 
of B, T, and ©, play a key role because they reflect the causal structure of the 
model. The interdependence between endogenous variables is described by B, and 
the responses of endogenous variables to exogenous shocks in Z is reflected in the 
parameter matrix I’. In this setup the causal parameters of interest are those that 
measure the direct marginal impact of a change in an explanatory variable, y; or 
zę on the outcome of interest y,, / Æ j, and functions of such parameters and data. 
The elements of 4 describe the dispersion and dependence properties of the ran- 
dom disturbances, and hence they measure some properties of the way the data are 
generated. 


2.4.2. Causal Interpretation in SEM 


A simple example will illustrate the causal interpretation of parameters in SEM. The 
structural model has two continuous endogenous variables yı and y2, a single con- 
tinuous exogenous variable zı, one stochastic relationship linking yı and y2, and one 
definitional identity linking all three variables in the model: 


yı = yı + biy +u, O< Bp <1, 
y2 = yı + z1- 


In this model u; is a stochastic disturbance, independent of zı, with a well-defined 
distribution. The parameter 6; is subject to an inequality constraint that is also a part 
of the model specification. The variable zı is exogenous and therefore its variation is 
induced by external sources that we may regard as interventions. These interventions 
have a direct impact on yz through the identity and also an indirect one through the 
first equation. The impact is measured by the reduced form of the model, which is 


_ Vi Bi 1 
{o ee eK 
= Efyi|zi] + vı, 
yı 1 1 
y2 


= + zı + u 
1-6, 1-6 1-6 
= Efyz|zı] + vı, 


where vı = u1/(1 — 61). The reduced form coefficients 61/(1 — 61) and 1/(1 — fi) 
have a causal interpretation. Any externally induced unit change in z will cause the 
value of yı and y2 to change by these amounts. Note that in this model yı and yp also 
respond to u;. In order not to confound the impact of the two sources of variation we 
require that z; and u; are independent. 

Also note that 


dy B. 1 
dy. 1-6, 1-6; 
_ dy , Ay2 
az az 
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In what sense does 6; measure the causal effect of yz on yı? To see a possible diffi- 
culty, observe that y; and y2 are interdependent or jointly determined, so it is unclear 
in what sense y2 “causes” yı. Although zı (and w;) is the ultimate cause of changes 
in the reduced form sense, yz is a proximate or an intermediate cause of yı. That is, 
the first structural equation provides a snapshot of the impact of y2 on yı, whereas 
the reduced form gives the (equilibrium) impact after allowing for all interactions be- 
tween the endogenous variables to work themselves out. In a SEM framework even 
endogenous variables are viewed as causal variables, and their coefficients as causal 
parameters. This approach can cause puzzlement for those who view causality in an 
experimental setting where independent sources of variation are the causal variables. 
The SEM approach makes sense if y2 has an independent and exogenous source of 
variation, which in this model is zı. Hence the marginal response coefficient 6; is a 
function of how yı and y2 respond to a change in z1, as the immediately preceding 
equation makes clear. 

Of course this model is but a special case. More generally, we may ask under what 
conditions will the SEM parameters have a meaningful causal interpretation. We return 
to this issue when discussing identification concepts in Section 2.5. 


2.4.3. Extensions to Nonlinear and Latent Variable Models 


If the simultaneous model is nonlinear in parameters only, the structural model can 
be written as 


YB(0) + ZI(@) = U, (2.15) 


where B(@) and T (0) are matrices whose elements are functions of the structural pa- 
rameters 0. An explicit reduced form can be derived as before. 

If nonlinearity is instead in variables then an explicit (analytical) reduced form 
may not be possible, although linearized approximations or numerical solutions of the 
dependent variables, given (z, u), can usually be obtained. 

Many microeconometric models involve latent or unobserved variables as well as 
observed endogenous variables. For example, search and auction theory models use the 
concept of reservation wage or reservation price, choice models invoke indirect utility, 
and so forth. In the case of such models the structural model (2.1) may be replaced by 


g(y;.z,,uj|0) = 0, (2.16) 


where the latent variables y¥ replace the observed variables y;. The corresponding 
reduced form solves for y; in terms of (z;, u;), yielding 


y; =f (z;, ujl7). (2.17) 


This reduced form has limited usefulness as y; is not fully observed. However, if we 
have functions y; = h(y;) that relate observable with latent counterparts of y;, then the 
reduced form in terms of observables is 


yi = h (f (z;, u;|7)). (2.18) 
See Section 16.8.2 for further details. 
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When the structural model involves nonlinearities in variables, or when latent vari- 
ables are involved, an explicit derivation of the functional form of this reduced form 
may be difficult to obtain. In such cases practitioners use approximations. By citing 
mathematical or computational convenience, a specific functional form may be used 
to relate an endogenous variable to all exogenous variables, and the result would be 
referred to as a “reduced form type relationship.” 


2.4.4. Interpretations of Structural Relationships 


Marschak (1953, p. 26) in an influential essay gave the following definition of a 
structure: 


Structure was defined as a set of conditions which did not change while observations 
were being made but which might change in future. If a specified change of struc- 
ture is expected or intended, prediction of variables of interest to the policy maker 
requires some knowledge of past structure... . In economics, the conditions that con- 
stitute a structure are (1) a set of relations describing human behavior and institutions 
as well as technological laws and involving, in general, nonobservable random dis- 
turbances and nonobservable random errors of measurement; (2) the joint probability 
distribution of these random quantities. 


Marschak argued that the structure was fundamental for a quantitative evaluation or 
tests of economic theory and that the choice of the best policy requires knowledge of 
the structure. 

In the SEM literature a structural model refers to “autonomous” (not “derived’’) 
relationships. There are other closely related concepts of a structure. One such concept 
refers to “deep parameters,” by which is meant technology and preference parameters 
that are invariant to interventions. 

In recent years an alternative usage of the term structure has emerged, one that refers 
to econometric models based on the hypothesis of dynamic stochastic optimization by 
rational agents. In this approach the starting point for any structural estimation prob- 
lem is the first-order necessary conditions that define the agent’s optimizing behavior. 
For example, in a standard problem of maximizing utility subject to constraints, the 
behavioral relations are the deterministic first-order marginal utility conditions. If the 
relevant functional forms are explicitly stated, and stochastic errors of optimization are 
introduced, then the first-order conditions define a behavioral model whose parameters 
characterize the utility function — the so-called deep or policy-invariant parameters. 
Examples are given in Sections 6.2.7 and 16.8.1. 

Two features of this highly structural approach should be mentioned. First, they 
rely on a priori economic theory in a serious manner. Economic theory is not used 
simply to generate a list of relevant variables that one can use in a more or less arbi- 
trarily specified functional form. Rather, the underlying economic theory has a major 
(but not exclusive) role in specification, estimation, and inference. The second feature 
is that identification, specification, and estimation of the resulting model can be very 
complicated, because the agent’s optimization problem is potentially very complex, 


28 


2.5. IDENTIFICATION CONCEPTS 


especially if dynamic optimization under uncertainty is postulated and discreteness 
and discontinuities are present; see Rust (1994). 


2.5. Identification Concepts 


The goal of the SEM approach is to consistently estimate (B, T, X) and conduct statis- 
tical inference. An important precondition for consistent estimation is that the model 
should be identified. We briefly discuss the important twin concepts of observational 
equivalence and identifiability in the context of parametric models. 

Identification is concerned with determination of a parameter given sufficient ob- 
servations. In this sense, it is an asymptotic concept. Statistical uncertainty necessarily 
affects any inference based on a finite number of observations. By hypothetically con- 
sidering the possibility that sufficient number of observations are available, it is pos- 
sible to consider whether it is logically possible to determine a parameter of interest 
either in the sense of its point value or in the sense of determining the set of which 
the parameter is an element. Therefore, identification is a fundamental consideration 
and logically occurs prior to and is separate from statistical estimation. A great deal of 
econometric literature on identification focuses on point identification. This is also the 
emphasis of this section. However, set identification, or bounds identification, is an 
important approach that will be used in selected places in this book (e.g., Chapters 25 
and 27; see Manski, 1995). 


Definition 2.3 (Observational Equivalence): Two structures of a model defined 
as joint probability distribution function Pr[x|@], x € W, 0 € ©, are observa- 
tionally equivalent if Pr[x|9'] = Pr[x|07] Y x € W. 


Less formally, if, given the data, two structural models imply identical joint proba- 
bility distributions of the variables, then the two structures are observationally equiva- 
lent. The existence of multiple observationally equivalent structures implies the failure 
of identification. 


Definition 2.4 (Identification): A structure 0° is identified if there is no other 
observationally equivalent structure in ©. 


A simple example of nonidentification occurs when there is perfect collinearity be- 
tween regressors in the linear regression y = X8 + u. Then we can identify the linear 
combination C6, where rank[C] < rank[(], but we cannot identify @ itself. 

This definition concerns uniqueness of the structure. In the context of the SEM 
we have given, this definition means that identification requires that there is a unique 
triple (B, T, ©) consistent with the observed data. In SEM, as in other cases, identi- 
fication involves being able to obtain unique estimates of structural parameters given 
the sample moments of the data. For example, in the case of the reduced form (2.12), 
under the stated assumptions the least-squares estimator provides unique estimates of 
II, that is, fi = [Z'Z]—-!Z’Y, and identification of B, T requires that there is a solution 
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for the unknown elements of T and B from the equations II + TB~' = 0, given a 
priori restrictions on the model. A unique solution implies just identification of the 
model. 

A complete model is said to be identified if all the model parameters are identified. 
It is possible that for some models only a subset of parameters is identified. In some 
situations it may be important to be able to identify some function of parameters, and 
not necessarily all the individual parameters. Identification of a function of parameters 
means that function can be recovered uniquely from F(W|O). 

How does one ensure that the structures of alternative model specifications can be 
“ruled out”? In SEM the solution to this problem depends on augmenting the sample 
information by a priori restrictions on (B, T, X). The a priori restrictions must intro- 
duce sufficient additional information into the model to rule out the existence of other 
observationally equivalent structures. 

The need for a priori restrictions is demonstrated by the following argument. First 
note that given the assumptions of Section 2.4.1 the reduced form, defined by (II, Q), 
is always unique. Initially suppose there are no restrictions on (B, T, X). Next suppose 
that there are two observationally equivalent structures (B1, T1, X1) and (B2, T2, X2). 
Then 


Il = -T,B,' = -T>B;"'. (2.19) 


Let H be a G x G nonsingular matrix. Then TB! = T,HH'B;' = TB7', which 
means that [, = LH, B = BH. Thus the second structure is a linear transformation 
of the first. 

The SEM solution to this problem is to introduce restrictions on (B, T, X) such 
that we can rule out the existence of linear transformations that lead to observation- 
ally equivalent structures. In other words, the restrictions on (B, T, X) must be such 
that there is no matrix H that would yield another structure with the same reduced 
form; given (II, Q) there will be a unique solution to the equations IIT = —'B~' and 
Q =(B'YEB!. 

In practice a variety of restrictions can be imposed including (1) normalizations, 
such as setting diagonal elements of B equal to 1, (2) zero (exclusion) and linear ho- 
mogeneous and inhomogeneous restrictions, and (3) covariance and inequality restric- 
tions. Details of the necessary and sufficient conditions for identification in linear and 
nonlinear models can be found in many texts including Sargan (1988). 

Meaningful imposition of identifying restrictions requires that the a priori restric- 
tions imposed should be valid a posteriori. This idea is pursued further in several chap- 
ters where identification issues are considered (e.g., Section 6.9). 

Exclusion restrictions essentially state that the model contains some variables that 
have zero impact on some endogenous variables. That is, certain directions of causa- 
tion are ruled out a priori. This makes it possible to identify other directions of cau- 
sation. For example, in the simple two-variable example given earlier, zı did not enter 
the yı-equation, making it possible to identify the direct impact of y2 on yı. Although 
exclusion restrictions are the simplest to apply, in parametric models identification can 
also be secured by inequality restrictions and covariance restrictions. 
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If there are no restrictions on X, and the diagonal elements of B are normalized to 
1, then a necessary condition for identification is the order condition, which states 
that the number of excluded exogenous variables must at least equal the number of 
included endogenous variables. A sufficient condition is the rank condition given in 
many texts that ensures for the jth equation parameters IIT; = —B; yields a unique 
solution for (T ;, B;) given IT. 

Given identification, the term just (exact) identification refers to the case when 
the order condition is exactly satisfied; overidentification refers to the case when the 
number of restrictions on the system exceeds that required for exact identification. 

Identification in nonlinear SEM has been discussed in Sargan (1988), who also 
gives references to earlier related work. 


2.6. Single-Equation Models 


Without loss of generality consider the first equation of a linear SEM subject to nor- 
malization 6,;; = 1. Let y = yı, let y; denote the endogenous components of y other 
than yı, and let z; denote the exogenous components of z with 


y=yjatzytu. (2.20) 


Many studies skip the formal steps involved in going from a system to a single equation 
and begin by writing the regression equation 


y=xß+u, 


where some components of x are endogenous (implicitly y,) and others are exogenous 
(implicitly zı). The focus lies then on estimating the impact of changes in key regres- 
sor(s) that may be endogenous or exogenous, depending on the assumptions. Instru- 
mental variable or two-stage least-squares estimation is the most obvious estimation 
strategy (see Sections 4.8, 6.4, and 6.5). 

In the SEM approach it is natural to specify at least some of the remaining equa- 
tions in the model, even if they are not the focus of inquiry. Suppose y; has dimen- 
sion 1. Then the first possibility is to specify the structural equation for yı and for 
the other endogenous variables that may appear in this structural equation for yı. 
A second possibility is to specify the reduced form equation for yı. This will show 
exogenous variables that affect yı but do not directly affect y. An advantage is that 
in such a setting instrumental variables emerge naturally. However, in recent empir- 
ical work using instrumental variables in a single-equation setting, even the formal 
step of writing down a reduced form for the endogenous right-hand-side variable is 
avoided. 


2.7. Potential Outcome Model 


Motivation for causal inference in econometric models is especially strong when the 
focus is on the impact of public policy and/or private decision variables on some 
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specific outcomes. Specific examples include the impact of transfer payments on labor 
supply, the impact of class size on student learning, and the impact of health insurance 
on utilization of health care. In many cases the causal variables themselves reflect 
individual decisions and hence are potentially endogenous. When, as is usually the 
case, econometric estimation and inference are based on observational data, iden- 
tification of and inference on causal parameters pose many challenges. These chal- 
lenges can become potentially less serious if the causal issues are addressed using 
data from a controlled social experiment with a proper statistical design. Although 
such experiments have been implemented (see Section 3.3 for examples and details) 
they are generally expensive to organize and run. Therefore, it is more attractive 
to implement causal modeling using data generated by a natural experiment or in 
a quasi-experimental setting. Section 3.4 discusses the pros and cons of these data 
structures; but for present purposes one should think of a natural or quasi experi- 
ment as a setting in which some causal variable changes exogenously and indepen- 
dently of other explanatory variables, making it relatively easier to identify causal 
parameters. 

A major obstacle for causality modeling stems from the fundamental problem of 
causal inference (Holland, 1986). Let X be the hypothesized cause and Y the outcome. 
By manipulating the value of X we can change the value of Y. Suppose the value of X 
is changed from x, to x2. Then a measure of the causal impact of the change on Y is 
formed by comparing the two values of Y: y2, which results from the change, and y,, 
which would have resulted had no change in x occurred. However, if X did change, 
then the value of Y, in the absence of the change, would not be observed. Hence noth- 
ing more can be said about causal impact without some hypothesis about what value 
Y would have assumed in the absence of the change in X. The latter is referred to 
as a counterfactual, which means hypothetical unobserved value. Briefly stated, all 
causal inference involves comparison of a factual with a counterfactual outcome. In 
the conventional econometric model (e.g., SEM) a counterfactual does not need to be 
explicitly stated. 

A relatively newer strand in the microeconometric literature — program evalua- 
tion or treatment evaluation — provides a statistical framework for the estimation 
of causal parameters. In the statistical literature this framework is also known as the 
Rubin causal model (RCM) in recognition of a key early contribution by Rubin 
(1974, 1978), who in turn cites R.A. Fisher as originator of the approach. Al- 
though, following recent convention, we refer to this as the Rubin causal model, 
Neyman (Splawa-Neyman) also proposed a similar statistical model in an article 
published in Polish in 1923; see Neyman (1990). Models involving counterfactuals 
have been independently developed in econometrics following the seminal work of 
Roy (1951). In the remainder of this section the salient features of RCM will be 
analyzed. 

Causal parameters based on counterfactuals provide statistically meaningful and 
operational definitions of causality that in some respects differ from the traditional 
Cowles foundation definition. First, in ideal settings this framework leads to consider- 
able simplicity of econometric methods. Second, this framework typically focuses on 
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the fewer causal parameters that are thought to be most relevant to policy issues that 
are examined. This contrasts with the traditional econometric approach that focuses 
simultaneously on all structural parameters. Third, the approach provides additional 
insights into the properties of causal parameters estimated by the standard structural 
methods. 


2.7.1. The Rubin Causal Model 


The term “treatment” is used interchangeably with “cause.” In medical studies of new 
drug evaluation, involving groups of those who receive the treatment and those who 
do not, the drug response of the treated is compared with that of the untreated. A mea- 
sure of causal impact is the average difference in the outcomes of the treated and the 
nontreated groups. In economics, the term treatment is used very broadly. Essentially 
it covers variables whose impact on some outcome is the object of study. Examples of 
treatment—outcome pairs include schooling and wages, class size and scholastic per- 
formance, and job training and earnings. Note that a treatment need not be exogenous, 
and in many situations it is an endogenous (choice) variable. 

Within the framework of a potential outcome model (POM), which assumes that 
every element of the target population is potentially exposed to the treatment, the triple 
(Yii; Yoi, Di), i =1,..., N, forms the basis of treatment evaluation. The categorical 
variable D takes the values | and 0, respectively, when treatment is or is not received; 
yı; measures the response for individual 7 receiving treatment, and yo; measures that 
when not receiving treatment. That is, 


= a if D; = 1, 


= 2.21 
i Yoi if Di =0. ( ) 


Since the receipt and nonreceipt of treatment are mutually exclusive states for indi- 
vidual i, only one of the two measures is available for any given i, the unavailable 
measure being the counterfactual. The effect of the cause D on outcome of individual 
i is measured by (yı; — yo;). The average causal effect of D; = 1, relative to D; = 0, 
is measured by the average treatment effect (ATE): 


ATE = E[y|D = 1]—Efy|D = 0], (2.22) 


where expectations are with respect to the probability distribution over the target pop- 
ulation. Unlike the conventional structural model that emphasizes marginal effects, the 
POM framework emphasizes ATE and parameters related to it. 

The experimental approach to the estimation of ATE-type parameters involves a 
random assignment of treatment followed by a comparison of the outcomes with a 
set of nontreated cases that serve as controls. Such an experimental design is explained 
in greater detail in Chapter 3. Random assignment implies that individuals exposed to 
treatment are chosen randomly, and hence the treatment assignment does not depend 
on the outcome and is uncorrelated with the attributes of treated subjects. Two ma- 
jor simplifications follow. The treatment variable can be treated as exogenous and its 
coefficient in a linear regression will not suffer from omitted variable bias if some 
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relevant variables are unavoidably omitted from the regression. Under certain condi- 
tions, discussed at greater length in Chapters 3 and 25, the mean difference between 
the outcomes of the treated and the control groups will provide an estimate of ATE. 
The payoff to the well-designed experiment is the relative simplicity with which causal 
statements can be made. Of course, to ensure high statistical precision for the treatment 
effect estimate, one should still control for those attributes that also independently in- 
fluence the outcomes. 

Because random assignment of treatment is generally not feasible in economics, 
estimation of ATE-type parameters must be based on observational data generated 
under nonrandom treatment assignment. Then the consistent estimation of ATE will 
be threatened by several complications that include, for example, possible correlation 
between the outcomes and treatment, omitted variables, and endogeneity of the treat- 
ment variable. Some econometricians have suggested that the absence of randomiza- 
tion comprises the major impediment to convincing statistical inference about causal 
relationships. 

The potential outcome model can lead to causal statements if the counterfactual can 
be clearly stated and made operational. An explicit statement of the counterfactual, 
with a clear implication of what should be compared, is an important feature of this 
model. If, as may be the case with observational data, there is lack of a clear distinc- 
tion between observed and counterfactual quantities, then the answer to the question 
of who is affected by the treatment remains unclear. ATE is a measure that weights and 
combines marginal responses of specific subpopulations. Specific assumptions are re- 
quired to operationalize the counterfactual. Information on both treated and untreated 
units that can be observed is needed to estimate ATE. For example, it is necessary to 
identify the untreated group that proxies the treated group if the treatment were not 
applied. It is not necessarily true that this step can always be implemented. The exact 
way in which the treated are selected involves issues of sampling design that are also 
discussed in Chapters 3 and 25. 

A second useful feature of the POM is that it identifies opportunities for causal 
modeling created by natural or quasi-experiments. When data are generated in such 
settings, and provided certain other conditions are satisfied, causal modeling can occur 
without the full complexities of the SEM framework. This issue is analyzed further in 
Chapters 3 and 25. 

Third, unlike the structural form of the SEM where all variables other than that be- 
ing explained can be labeled as “causes,” in the POM not all explanatory variables can 
be regarded as causal. Many are simply attributes of the units that must be controlled 
for in regression analysis, and attributes are not causes (Holland, 1986). Causal param- 
eters must relate to variables that are actually or potentially, and directly or indirectly, 
subject to intervention. 

Finally, identifiability of the ATE parameter may be an easier research goal and 
hence feasible in situations where the identifiability of a full SEM may not be (Angrist, 
2001). Whether this is so has to be determined on a case-by-case basis. However, 
many available applications of the POM typically employ a limited, rather than full, 
information framework. However, even within the SEM framework the use of a limited 
information framework is also feasible, as was previously discussed. 
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2.8. Causal Modeling and Estimation Strategies 


In this section we briefly sketch some of the ways in which econometricians approach 
the modeling of causal relationships. These approaches can be used within both SEM 
and POM frameworks, but they are typically identified with the former. 


2.8.1. Identification Frameworks 
Full-Information Structural Models 


One variant of this approach is based on the parametric specification of the joint distri- 
bution of endogenous variables conditional on exogenous variables. The relationships 
are not necessarily derived from an optimizing model of behavior. Parametric restric- 
tions are placed to ensure identification of the model parameters that are the target 
of statistical inference. The entire model is estimated simultaneously using maximum 
likelihood or moments-based estimation. We call this approach the full-information 
structural approach. For well-specified models this is an attractive approach but in 
general its potential limitation is that it may contain some equations that are poorly 
specified. Under joint estimation the effects of localized misspecification may also 
affect other estimates. 

Statistically we may interpret the full-information approach as one in which the 
joint probability distribution of endogenous variables, given the exogenous variables, 
forms the basis of inference about causality. The jointness may derive from contem- 
poraneous or dynamic interdependence between endogenous variables and/or the dis- 
turbances on the equations. 


Limited-Information Structural Models 


By contrast, when the central object of statistical inference is estimation of one or two 
key parameters, a limited-information approach may be used. A feature of this ap- 
proach is that, although one equation is the focus of inference, the joint dependence 
between it and other endogenous variables is exploited. This requires that explicit as- 
sumptions are made about some features of the model that are not the main object of 
inference. Instrumental variable methods, sequential multistep methods, and limited 
information maximum likelihood methods are specific examples of this approach. To 
implement the approach one typically works with one (or more) structural equations 
and some implicitly or explicitly stated reduced form equations. This contrasts with the 
full-information approach where all equations are structural. The limited-information 
approach is often computationally more tractable than the full-information one. 

Statistically we may interpret the limited-information approach as one in which the 
joint distribution is factored into the product of a conditional model for the endogenous 
variable(s) of interest, say yı, and a marginal model for other endogenous variables, 
say y2, which are in the set of the conditioning variables, as in 


FOIX, 8) = 8011x, y2, Oh(yolx, 02), AE O. (2.23) 
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Modeling may be based on the component g(y,|x, y2, 01) with minimal attention to 
h(¥2|x, 02) if 02 are regarded as nuisance parameters. Of course, such a factorization 
is not unique, and hence the limited-information approach can have several variants. 


Identified Reduced Forms 


A third variant of the SEM approach works with an identified reduced form. Here too 
one is interested in structural parameters. However, it may be convenient to estimate 
these from the reduced form subject to restrictions. In time series the identified vector 
autoregressions provide an example. 


2.8.2. Identification Strategies 


There are numerous potential ways in which the identification of key model parameters 
can be jeopardized. Omitted variables, functional form misspecifications, measure- 
ment errors in explanatory variables, using data unrepresentative of the population, and 
ignoring endogeneity of explanatory variables are leading examples. Microeconomet- 
rics contains many specific examples of how these challenges can be tackled. Angrist 
and Krueger (2000) provide a comprehensive survey of popular identification strate- 
gies in labor economics, with emphasis on the POM framework. Most of the issues are 
developed elsewhere in the book, but a brief mention is made here. 


Exogenization 


Data are sometimes generated by natural experiments and quasi-experiments. The idea 
here is simply that a policy variable may exogenously change for some subpopulation 
while it remains the same for other subpopulations. For example, minimum wage laws 
in one state may change while they remain unchanged in a neighboring state. Such 
events naturally create treatment and control groups. If the natural experiment ap- 
proximates a randomized treatment assignment, then exploiting such data to estimate 
structural parameters can be simpler than estimation of a larger simultaneous equa- 
tions model with endogenous treatment variables. It is also possible that the treatment 
variable in a natural experiment can be regarded as exogenous, but the treatment itself 
is not randomly assigned. 


Elimination of Nuisance Parameters 


Identification may be threatened by the presence of a large number of nuisance param- 
eters. For example, in a cross-section regression model the conditional mean function 
E[y;|x;] may involve an individual specific fixed effect @;, assumed to be correlated 
with the regression error. This effect cannot be identified without many observations 
on each individual (i.e., panel data). However, with just a short panel it could be elim- 
inated by a transformation of the model. Another example is the presence of timein- 
variant unobserved exogenous variables that may be common to groups of individuals. 
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An example of a transformation that eliminates fixed effects is taking differences and 
working with the differences-in-differences form of the model. 


Controlling for Confounders 


When variables are omitted from a regression, and when omitted factors are correlated 
with the included variables, a confounding bias results. For example, in a regression 
with earnings as a dependent variable and schooling as an explanatory variable, indi- 
vidual ability may be regarded as an omitted variable because only imperfect proxies 
for it are typically available. This means that potentially the coefficient of the school- 
ing variable may not be identified. One possible strategy is to introduce control vari- 
ables in the model; the general approach is called the control function approach. 
These variables are an attempt to approximate the influence of the omitted variables. 
For example, various types of scholastic achievement scores may serve as controls for 
ability. 


Creating Synthetic Samples 


Within the POM framework a causal parameter may be unidentified because no suit- 
able comparison or control group can provide the benchmark for estimation. A poten- 
tial solution is to create a synthetic sample that includes a comparison group that are 
proxies for controls. Such a sample is created by matching (discussed in Chapter 25). 
If treated samples can be augmented by well-matched controls, then identification of 
causal parameters can be achieved in the sense that a parameter related to ATE can be 
estimated. 


Instrumental Variables 


If identification is jeopardized because the treatment variable is endogenous, then a 
standard solution is to use valid instrumental variables. This is easier said than done. 
The choice of the instrumental variable as well as the interpretation of the results 
obtained must be done carefully because the results may be sensitive to the choice of 
instruments. The approach is analyzed in Sections 4.8, 4.9, 6.4, 6.5, and 25.7, as well 
as in several other places in the book as the need arises. Again a natural experiment 
may provide a valid instrument. 


Reweighting Samples 


Sample-based inferences about the population are only valid if the sample data are 
representative of the population. The problem of sample selection or biased sampling 
arises when the sample data are not representative, in which case the population param- 
eters are not identified. This problem can be approached as one that requires correction 
for sample selection (Chapter 16) or one that requires reweighting of the sample infor- 
mation (Chapter 24). 
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2.9. Bibliographic Notes 


The 2001 Nobel lectures by Heckman and McFadden are excellent sources for both his- 
torical and current information about the developments in microeconometrics. Heckman’s 
lecture is remarkable for its comprehensive scope and offers numerous insights into many 
aspects of microeconometrics. His discussion of heterogeneity has many points of contact 
with several topics covered in this book. 

Marschak (1953) gives a classic statement of the primacy of structural modeling for policy 
evaluation. He makes an early mention of the idea of parameter invariance. 

Engle, Hendry, and Richard (1983) provide definitions of weak and strong exogeneity in 
terms of the distribution of observable variables. They make links with previous literature 
on exogeneity concepts. 

and 2.5 The term “identification” was used by Koopmans (1949). Point identification in 
linear parametric models is covered in most textbooks including those by Sargan (1988) 
who gives a comprehensive and succint treatment, Davidson and MacKinnon (2004), and 
Greene (2003). Gouriéroux and Monfort (1989, chapter 3.4) provide a different perspective 
using Fisher and Kullback information measures. Bounds identification in several leading 
cases is developed in Manski (1995). 

Heckman (2000) provides a historical overview and modern interpretations of causality in 
the traditional econometric model. Causality concepts within the POM framework are care- 
fully and incisively analyzed by Holland (1986), who also relates them to other definitions. 
A sample of the statisticians’ viewpoints of causality from a historical perspective can be 
found in Freedman (1999). Pearl (2000) gives insightful schematic exposition of the idea 
of “treating causation as a summary of behavior under interventions,” as well as numerous 
problems of inferring causality in a nonexperimental situation. 

Angrist and Krueger (1999) survey solutions to identification pitfalls with examples from 
labor economics. 
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CHAPTER 3 


Microeconomic Data Structures 


3.1. Introduction 


This chapter surveys issues concerning the potential usefulness and limitations of dif- 
ferent types of microeconomic data. By far the most common data structure used in 
microeconometrics is survey or census data. These data are usually called observa- 
tional data to distinguish them from experimental data. 

This chapter discusses the potential limitation of the aforementioned data struc- 
tures. The inherent limitations of observational data may be further compounded by 
the manner in which the data are collected, that is, by the sample frame (the way the 
sample is generated), sample design (simple random sample versus stratified random 
sample), and sample scope (cross-section versus longitudinal data). Hence we also 
discuss sampling issues in connection with the use of observational data. Some of this 
terminology is new at this stage but will be explained later in this chapter. 

Microeconometrics goes beyond the analysis of survey data under the assumptions 
of simple random sampling. This chapter considers extensions. Section 3.2 outlines 
the structure of multistage sample surveys and some common forms of departure from 
random sampling; a more detailed analysis of their statistical implications is provided 
in later chapters. It also considers some commonly occurring complications that result 
in the data not being necessarily representative of the population. Given the deficien- 
cies of observational data in estimating causal parameters, there has been an increased 
attempt at exploiting experimental and quasi-experimental data and frameworks. Sec- 
tion 3.3 examines the potential of data from social experiments. Section 3.4 considers 
the modeling opportunities arising from a special type of observational data, generated 
under quasi-experimental conditions, that naturally provide treated and untreated sub- 
jects and hence are called natural experiments. Section 3.5 covers practical issues of 
microdata management. 
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3.2. Observational Data 


The major source of microeconomic observational data is surveys of households, firms, 
and government administrative data. Census data can also be used to generate samples. 
Many other samples are often generated at points of contact between transacting par- 
ties. For example, marketing data may be generated at the point of sale and/or surveys 
among (actual or potential) purchasers. The Internet (e.g., online auctions) is also a 
source of data. 

There is a huge literature on sample surveys from the viewpoint of both survey 
statisticians and users of survey data. The first discusses how to sample from the pop- 
ulation and the results from different sampling designs, and the second deals with the 
issues of estimation and inference that arise when survey data are collected using dif- 
ferent sampling designs. A key issue is how well the sample represents the population. 
This chapter deals with both strands of the literature in an introductory fashion. Many 
additional details are given in Chapter 24. 


3.2.1. Nature of Survey Data 


The term observational data usually refers to survey data collected by sampling the 
relevant population of subjects without any attempt to control the characteristics of 
the sampled data. Let ¢ denote the time subscript, let w denote a set of variables 
of interest. In the present context ¢ can be a point in time or time interval. Let 
S, denote a sample from population probability distribution F(w,|0,); S, is a draw 
from F(w,|0;), where @ is a parameter vector. The population should be thought 
of as a set of points with characteristics of interest, and for simplicity we assume 
that the form of the probability distribution F is known. A simple random sam- 
pling scheme allows every element of the population to have an equal probability of 
being included in the sample. More complex sampling schemes will be considered 
later. 

The abstract concept of a stationary population provides a useful benchmark. If 
the moments of the characteristics of the population are constant, then we can write 
6, = 0, for all t. This is a strong assumption because it implies that the moments of 
the characteristics of the population are time-invariant. For example, the age—sex dis- 
tribution should be constant. More realistically, some population characteristics would 
not be constant. To handle such a possibility, (the parameters of ) each population may 
be regarded as a draw from a superpopulation with constant characteristics. Specif- 
ically, we think of each 0, as a draw from a probability distribution with constant 
(hyper)parameter 0. The terms superpopulation and hyperparameters occur frequently 
in the literature on hierarchical models discussed in Chapter 24. Additional complica- 
tions arise if 0, has an evolutionary component, for example through dependence on 
t, or if successive values are interdependent. Using hierarchical models, discussed in 
Chapters 13 and 26, provides one approach for modeling the relation between hyper- 
parameters and subpopulation characteristics. 
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3.2.2. Simple Random Samples 


As a benchmark for subsequent discussion, consider simple random sampling in which 
the probability of sampling unit i from a population of size N, with N large, is 1/N for 
all i. Partition w as [y : x]. Suppose our interest is in modeling y, a possibly vector- 
valued outcome variable, conditional on the exogenous covariate vector x, whose joint 
distribution is denoted f;(y, x). This can be always be factored as the product of the 
conditional distribution fc(y|x, 0) and the marginal distribution fm(X): 


J10, x) = felz, 8) fu). (3.1) 


Simple random sampling involves drawing the (y, x) combinations uniformly from 
the entire population. 


3.2.3. Multistage Surveys 


One alternative is a stratified multistage cluster sampling, also referred to as a com- 
plex survey method. Large-scale surveys like the Current Population Survey (CPS) 
and the Panel Survey of Income Dynamics (PSID) take this approach. Section 24.2 
provides additional detail on the structure of the CPS. 

The complex survey design has advantages. It is more cost effective because it 
reduces geographical dispersion, and it becomes possible to sample certain subpop- 
ulations more intensively. For example, “oversampling” of small subpopulations ex- 
hibiting some relevant characteristic becomes feasible whereas a random sample of the 
population would produce too few observations to support reliable results. A disadvan- 
tage is that stratified sampling will reduce interindividual variation, which is essential 
for greater precision. 

The sample survey literature focuses on multistage surveys that sequentially parti- 
tion the population into the following categories: 


1. Strata: Nonoverlapping subpopulations that exhaust the population. 
2. Primary sampling units (PSUs): Nonoverlapping subsets of the strata. 


3. Secondary sampling units (SSUs): Sub-units of the PSU, which may in turn be parti- 
tioned, and so on. 


4. Ultimate sampling unit (USU): The final unit chosen for interview, which could be a 
household or a collection of households (a segment). 


As an example, the strata may be the various states or provinces in a country, the 
PSU may be regions within the state or province, and the USU may be a small cluster 
of households in the same neighborhood. 

Usually all strata are surveyed so that, for example, all states will be included in 
the sample with certainty. But not all of the PSUs and their subdivisions are surveyed, 
and they may be sampled at different rates. In two-stage sampling the surveyed PSUs 
are drawn at random and the USU is then drawn at random from the selected PSUs. In 
multistage sampling intermediate sampling units such as SSUs also appear. 
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A consequence of these sampling methods is that different households will have 
different probabilities of being sampled. The sample is then unrepresentative of the 
population. Many surveys provide sampling weights that are intended to be inversely 
proportional to the probability of being sampled, in which case these weights can be 
used to obtain unbiased estimators of population characteristics. 

Survey data may be clustered due to, for example, sampling of many households 
in the same small neighborhood. Observations in the same cluster are likely to be de- 
pendent or correlated because they may depend on some observable or unobservable 
factor that could affect all observations in a stratum. For example, a suburb may be 
dominated by high-income households or by households that are relatively homoge- 
neous in some dimension of their preferences. Data from these households will tend 
to be correlated, at least unconditionally, though it is possible that such correlation 
is negligible after conditioning on observable characteristics of the households. Sta- 
tistical inference ignoring correlation between sampled observations yields erroneous 
estimates of variances that are smaller than those from the correct formula. These is- 
sues are covered in greater depth in Section 24.5. Two-stage and multistage samples 
potentially further complicate the computation of standard errors. 

In summary, (1) stratification with different sampling rates within strata means that 
the sample is unrepresentative of the population; (2) sampling weights inversely pro- 
portional to the probability of being sampled can be used to obtain unbiased estimation 
of population characteristics; and (3) clustering may lead to correlation of observations 
and understatement of the true standard errors of estimators unless appropriate adjust- 
ments are made. 


3.2.4. Biased Samples 


If a random sample is drawn then the probability distribution for the data is the same 
as the population distribution. Certain departures from random sampling cause a di- 
vergence between the two; this is referred to as biased sampling. The data distribution 
differs from the population distribution in a manner that depends on the nature of the 
deviation from random sampling. Deviation from random sampling occurs because it 
is Sometimes more convenient or cost effective to obtain the data from a subpopulation 
even though it is not representative of the entire population. We now consider several 
examples of such departures, beginning with a case in which there is no departure from 
randomness. 


Exogenous Sampling 


Exogenous sampling from survey data occurs if the analyst segments the available 
sample into subsamples based only on a set of exogenous variables x, but not on the 
response variable. For example, in a study of hospitalizations in Germany, Geil et al. 
(1997) segmented the data into two categories, those with and without chronic condi- 
tions. Classification by income categories is also common. Perhaps it is more accurate 
to depict this type of sampling as exogenous subsampling because it is done by ref- 
erence to an existing sample that has already been collected. Segmenting an existing 
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sample by gender, health, or socioeconomic status is very common. Under the assump- 
tions of exogenous sampling the probability distribution of the exogenous variables 
is independent of y and contains no information about the population parameters of 
interest, 0. Therefore, one may ignore the marginal distribution of the exogenous vari- 
ables and simply base estimation on the conditional distribution f(y|x, 0). Of course, 
the assumption may be wrong and the observed distribution of the outcome variable 
may depend on the selected segmenting variable, which may be correlated with the 
outcome, thus causing departure from exogenous sampling. 


Response-Based Sampling 


Response-based sampling occurs if the probability of an individual being included 
in the sample depends on the responses or choices made by that individual. In this 
case sample selection proceeds in terms of rules defined in terms of the endogenous 
variable under study. 

Three examples are as follows: (1) In a study of the effect of negative income tax or 
Aid to Families with Dependent Children (AFDC) on labor supply only those below 
the poverty line are surveyed. (2) In a study of determinants of public transport modal 
choice, only users of public transport (a subpopulation) are surveyed. (3) In a study of 
the determinants of number of visits to a recreational site, only those with at least one 
visit are included. 

Lower survey costs provide an important motivation for using choice-based samples 
in preference to simple random samples. It would require a very large random sample 
to generate enough observations (information) about a relatively infrequent outcome 
or choice, and hence it is cheaper to collect a sample from those who have actually 
made the choice. 

The practical significance of this is that consistent estimation of population param- 
eters 0 can no longer be carried out using the conditional population density f(y|x) 
alone. The effect of the sampling scheme must also be taken into account. This topic 
is discussed further in Section 24.4. 


Length-Biased Sampling 


Length-biased sampling illustrates how biases may result from sampling one popu- 
lation to make inferences about a different population. Strictly speaking, it is not so 
much an example of departure from randomness in sampling as one of sampling the 
“wrong” population. 

Econometric studies of transitions model the time spent in origin state j by indi- 
vidual i before transiting to another destination state s. An example is when j cor- 
responds to unemployment and s to employment. The data used in such studies can 
come from one of several possible sources. One source is sampling individuals who 
are unemployed on a particular date, another is to sample those who are in the labor 
force regardless of their current state, and a third is to sample individuals who are ei- 
ther entering or leaving unemployment during a specified period of time. Each type 
of sampling scheme is based on a different concept of the relevant population. In the 
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first case the relevant population is the stock of unemployed individuals, in the second 
the labor force, and in the third individuals with transitioning employment status. This 
topic is discussed further in Section 18.6. 

Suppose that the purpose of the survey is to calculate a measure of the average 
duration of unemployment. This is the average length of time a randomly chosen indi- 
vidual will spend in unemployment if he or she becomes unemployed. The answer to 
this apparently straightforward question may vary depending on how the sample data 
are obtained. The flow distribution of completed durations is in general quite differ- 
ent from the stock distribution. When we sample the stock, the probability of being in 
the sample is higher for individuals with longer durations. When we sample the flow 
out of the state, the probability does not depend on the time spent in the state. This 
is the well-known example of length-biased sampling in which the estimate obtained 
by sampling the stock is a biased estimate of the average length of an unemployment 
spell of a random entrant to unemployment. 

The following simple schematic diagram may clarify the point: 


ee 
Ooo => > C00 ®@ 


Entry flow Exit flow 


Here we use the symbol e to denote slow movers and the symbol o to denote fast 
movers. Suppose the two types are equally represented in the flow, but the slow movers 
stay in the stock longer than the fast movers. Then the stock population has a higher 
proportion of slow movers. Finally, the exit population has a higher proportion of fast 
movers. The argument will generalize to other types of heterogeneity. 

The point of this example is not that flow sampling is a better thing to do than stock 
sampling. Rather, it is that, depending on what the question is, stock sampling may not 
yield a random sample of the relevant population. 


3.2.5. Bias due to Sample Selection 


Consider the following problem. A researcher is interested in measuring the effect of 
training, denoted z (treatment), on posttraining wages, denoted y (outcome), given the 
worker’s characteristics, denoted x. The variable z takes the value 1 if the worker has 
received training and is 0 otherwise. Observations are available on (x, D) for all work- 
ers but on y only for those who received training (D = 1). One would like to make 
inferences about the average impact of training on the posttraining wage of a ran- 
domly chosen worker with known characteristics who is currently untrained (D = 0). 
The problem of sample selection concerns the difficulty of making such an inference. 

Manski (1995), who views this as a problem of identification, defines the selection 
problem formally as follows: 


This is the problem of identifying conditional probability distributions from random 
sample data in which the realizations of the conditioning variables are always ob- 
served but realizations of the outcomes are censored. 
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Suppose y is the outcome to be predicted, and the conditioning variables are denoted 
by x. The variable z is a censoring indicator that takes the value 1 if the outcome y is 
observed and 0 otherwise. Because the variables (D, x) are always observed, but y is 
observed only when D = 1, Manski views this as a censored sampling process. The 
censored sampling process does not identify Pr[y|x], as can be seen from 


Prfy|x] = Prly|x, D = 1) Pr[D = 1|x] + Prly|x, D = 0] Pr[D = Olx]. (3.2) 


The sampling process can identify three of the four terms on the right-hand side, 
but provides no information about the term Pr[y|x, D = 0]. Because 


E[y|x] = ELy|x, D = 1]- Pr[D = 1|x] + Ely|x, D =0]- Pr[D = O|x], 


whenever the censoring probability Pr[D = 0|x] is positive, the available empirical 
evidence places no restrictions on E[y|x]. Consequently, the censored-sampling pro- 
cess can identify Pr[y|x] only for some unknown value of Pr[y|x, D = 0]. To learn 
anything about the E[y|x], restrictions will need to be placed on Pr[y|x]. 

The alternative approaches for solving this problem are discussed in Section 16.5. 


3.2.6. Quality of Survey Data 


The quality of sample data depends not only on the sample design and the survey 
instrument but also on the survey responses. This observation applies especially to 
observational data. We consider several ways in which the quality of the sample data 
may be compromised. Some of the problems (e.g., attrition) can also occur with other 
types of data. This topic overlaps with that of biased sampling. 


Problem of Survey Nonresponse 


Surveys are normally voluntary, and incentive to participate may vary systematically 
according to household characteristics and type of question asked. Individuals may 
refuse to answer some questions. If there is a systematic relationship between refusal 
to answer a question and the characteristics of the individual, then the issue of the 
representativeness of a survey after allowing for nonresponse arises. If nonresponse 
is ignored, and if the analysis is carried out using the data from respondents only, how 
will the estimation of parameters of interest be affected? 

Survey nonresponse is a special case of the selection problem mentioned in the 
preceding section. Both involve biased samples. To illustrate how it leads to distorted 
inference consider the following model: 


1 2 
[l= (ea e n 
y2 ZY 012 03 


where y; is a continuous random variable of interest (e.g., expenditure) that depends 
on x, and y? is a latent variable that measures the “propensity to participate” in a survey 
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and depends on z. The individual participates if y2 > 0; otherwise the individual does 
not. The variables x and z are assumed to be exogenous. The formulation allows yı 
and yz to be correlated. 

Suppose we estimate @ from the data supplied by participants by least squares. 
Is this estimator unbiased in the presence of nonparticipation? The answer is that if 
nonparticipation is random and independent of yı, the variable of interest, then there 
is no bias, but otherwise there will be. 

The argument is as follows: 


B = [Xx] X'y, 
E(B - 6] = E| [X’X] ' X'Elyı — XAIX, Z, yz > 0], 


where the first line gives the least-squares formula for the estimates of 8 and the second 
line gives its bias. If yı and y2 are independent, conditional on X and Z, o2 = 0, 
then 


Ely, — XG|X, Z, y, > 0] = Ely, — XG|X, Z] = 0, 


and there is no bias. 


Missing and Mismeasured Data 


Survey respondents dealing with an extensive questionnaire will not necessarily an- 
swer every question and even if they do, the answers may be deliberately or fortu- 
itously false. Suppose that the sample survey attempts to obtain a vector of responses 
denoted as x; =(%j1,...-,Xix) from N individuals, i = 1,..., N. Suppose now that 
if an individual fails to provide information on any one or more elements of x;, then 
the entire vector is discarded. The first problem resulting from missing data is that the 
sample size is reduced. The second potentially more serious problem is that missing 
data can potentially lead to biases similar to the selection bias. If the data are missing 
in a systematic manner, then the sample that is left to analyze may not be represen- 
tative of the population. A form of selection bias may be induced by any systematic 
pattern of nonresponse. For example, high-income respondents may systematically not 
respond to questions about income. Conversely, if the data are missing completely at 
random then discarding incomplete observations will reduce precision but not gen- 
erate biases. Chapter 27 discusses the missing-data problem and solutions in greater 
depth. 

Measurement errors in survey responses are a pervasive problem. They can arise 
from a variety of causes, including incorrect responses arising from carelessness, de- 
liberate misreporting, faulty recall of past events, incorrect interpretation of questions, 
and data-processing errors. A deeper source of measurement error is due to the mea- 
sured variable being at best an imperfect proxy for the relevant theoretical concept. 
The consequences of such measurement errors is a major topic and is discussed in 
Chapter 26. 
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Sample Attrition 


In panel data situations the survey involves repeated observations on a set of individu- 
als. In this case we can have 


e full response in all periods (full participation), 
e nonresponse in the first period and in all subsequent periods (nonparticipation), or 


e partial response in the sense of response in the initial periods but nonresponse in later 
periods (incomplete participation) — a situation referred to as sample attrition. 


Sample attrition leads to missing data, and the presence of any nonrandom pattern 
of “missingness” will lead to the sample selection type problems already mentioned. 
This can be interpreted as a special case of the sample selection problem. Sample 
attrition is discussed briefly in Sections 21.8.5 and 23.5.2. 


3.2.7. Types of Observational Data 


Cross-section data are obtained by observing w, for the sample S, for some t. Al- 
though it is usually impractical to sample all households at the same point of time, 
cross-section data are still a snapshot of characteristics of each element of a subset of 
the population that will be used to make inferences about the population. If the pop- 
ulation is stationary, then inferences made about 0, using S; may be valid also for 
t' Æ t. If there is significant dependence between past and current behavior, then lon- 
gitudinal data are required to identify the relationship of interest. For example, past 
decisions may affect current outcomes; inertia or habit persistence may account for 
current purchases, but such dependence cannot be modeled if the history of purchases 
is not available. This is one of the limitations imposed by cross-section data. 

Repeated cross-section data are obtained by a sequence of independent samples 
S, taken from F(w,|6,), t = 1,..., T. Because the sample design does not attempt to 
retain the same units in the sample, information about dynamic dependence in behavior 
is lost. If the population is stationary then repeated cross-section data are obtained by 
a sampling process somewhat akin to sampling with replacement from the constant 
population. If the population is nonstationary, repeated cross sections are related in a 
manner that depends on how the population is changing over time. In such a case the 
objective is to make inferences about the underlying constant (hyper)parameters. The 
analysis of repeated cross sections is discussed in Section 22.7. 

Panel or longitudinal data are obtained by initially selecting a sample S and 
then collecting observations for a sequence of time periods, t = 1,..., T. This can 
be achieved by interviewing subjects and collecting both present and past data at the 
same time, or by tracking the subjects once they have been inducted into the survey. 
This produces a sequence of data vectors {w;,..., Wr} that are used to make infer- 
ences about either the behavior of the population or that of the particular sample of 
individuals. The appropriate methodology in each case may not be the same. If the 
data are drawn from a nonstationary population, the appropriate objective should be 
inference on (hyper)parameters of the superpopulation. 
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Some limitations of these types of data are immediately obvious. Cross-section 
samples and repeated cross-sections do not in general provide suitable data for mod- 
eling intertemporal dependence in outcomes. Such data are only suitable for modeling 
static relationships. In contrast, longitudinal data, especially if they span a sufficiently 
long time period, are suitable for modeling both static and dynamic relationships. 

Longitudinal data are not free from problems. The first issue is representativeness of 
the panel. Problems of inference regarding population behavior using longitudinal data 
become more difficult if the population is not stationary. For analyzing dynamics of be- 
havior, retaining original households in the panel for as long as possible is an attractive 
option. In practice, longitudinal data sets suffer from the problem of “sample attrition,” 
perhaps due to “sample fatigue.” This simply means that survey respondents do not 
continue to provide responses to questionnaires. This creates two problems: (1) The 
panel becomes unbalanced and (2) there is the danger that the retained household may 
not be “typical” and that the sample becomes unrepresentative of the population. When 
the available sample data are not a random draw from the population, results based on 
different types of data will be susceptible to biases to different degrees. The problem 
of “sample fatigue” arises because over time it becomes more difficult to retain in- 
dividuals within the panel or they may be “lost” (censored) for some other reason, 
such as a change of location. These issues are dealt with later in the book. Analysis 
of longitudinal data may nevertheless provide information about some aspects of the 
behavior of the sampled units, although extrapolation to population behavior may not 
be straightforward. 


3.3. Data from Social Experiments 


Observational and experimental data are distinct because an experimental environment 
can in principle be closely monitored and controlled. This makes it possible to vary 
a causal variable of interest, holding other covariates at controlled settings. In con- 
trast, observational data are generated in an uncontrolled environment, leaving open 
the possibility that the presence of confounding factors will make it more difficult to 
identify the causal relationship of interest. For example, when one attempts to study 
the earnings—schooling relationship using observational data, one must accept that the 
years of schooling of an individual is itself an outcome of an individual’s decision- 
making process, and hence one cannot regard the level of schooling as if it had been 
set by a hypothetical experimenter. 

In social sciences, data analogous to experimental data come from either social 
experiments, defined and described in greater detail in the following, or from “labo- 
ratory” experiments on small groups of voluntary participants that mimic the behavior 
of economic agents in the real-life counterpart of the experiment. Social experiments 
are relatively uncommon, and yet experimental concepts, methods, and data serve as a 
benchmark for evaluating econometric studies based on observational data. 

This section provides a brief account of the methodology of social experiments, the 
nature of the data emanating from them, and some problems and issues of econometric 
methodology that they generate. 
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The central feature of the experimental methodology involves a comparison be- 
tween the outcomes of the randomly selected experimental group that is subjected to a 
“treatment” with those of a control (comparison) group. In a good experiment consid- 
erable care is exercised in matching the control and experimental (“treated”) groups, 
and in avoiding potential biases in outcomes. Such conditions may not be realized 
in observational environments, thereby leading to a possible lack of identification of 
causal parameters of interest. Sometimes, however, experimental conditions may be 
approximately replicated in observational data. Consider, for example, two contigu- 
ous regions or states, one of which pursues a different minimum-wage policy from the 
other, creating the conditions of a natural experiment in which observations from the 
“treated” state can be compared with those from the “control” state. The data structure 
of a natural experiment has also attracted attention in econometrics. 

A social experiment involves exogenous variations in the economic environment 
facing the set of experimental subjects, which is partitioned into one subset that re- 
ceives the experimental treatment and another that serves as a control group. In con- 
trast to observational studies in which changes in exogenous and endogenous factors 
are often confounded, a well-designed social experiment aims to isolate the role of 
treatment variables. In some experimental designs there may be no explicit control 
group, but varying levels of the treatment are applied, in which case it becomes pos- 
sible in principle to estimate the entire response surface of experimental outcomes. 

The primary object of a social experiment is to estimate the impact of an actual 
or potential social program. The potential outcome model of Section 2.7 provides a 
relevant background for modeling the impact of social experiments. Several alternative 
measures of impact have been proposed and these will be discussed in the chapter on 
program evaluation (Chapter 25). 

Burtless (1995) summarizes the case for social experiments, while noting some 
potential limitations. In a companion article Heckman and Smith (1995) focus on 
limitations of actual social experiments that have been implemented. The remaining 
discussion in this section borrows significantly from these papers. 


3.3.1. Leading Features of Social Experiments 


Social experiments are motivated by policy issues about how subjects would react to a 
type of policy that has never been tried and hence one for which no observed response 
data exist. The idea of a social experiment is to enlist a group of willing participants, 
some of whom are randomly assigned to a treatment group and the rest to a control 
group. The difference between the responses of those in the treatment group, subjected 
to the policy change, and those in the control group, who are not, is the estimated 
effect of the policy. Schematically the standard experimental design is as depicted in 
Figure 3.1. 

The term “experimentals” refers to the group receiving treatments, “controls” to the 
group not receiving treatment, and “random assignment” to the process of assigning 
individuals to the two groups. 

Randomized trials were introduced in statistics by R. A. Fisher (1928) and his 
co-workers. A typical agricultural experiment would consist of a trial in which a new 
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Assign to 
treatment 


Randomize 


Eligible 


subject Agrees to Assign to 
invited to participate? control 


participate 


Drop from 
study 
Figure 3.1: Social experiment with random assignment. 


treatment such as fertilizer application would be applied to plants growing on ran- 
domly chosen blocks of land and then the responses would be compared with those 
of a control group of plants, similar to the experimentals in all relevant respects but 
not given experimental treatment. If the effect of all other differences between the ex- 
perimental and control groups can be eliminated, the estimated difference between the 
two sets of responses can be attributed to the treatment. In the simplest situation one 
can concentrate on a comparison of the mean outcome of the treated group and of the 
untreated group. 

Although in agricultural and biomedical sciences, the randomized experiments 
methodology has been long established, in economics and social sciences it is new. 
It is attractive for studying responses to policy changes for which no observational 
data exist, perhaps because the policy changes of interest have never occurred. Ran- 
domized experiments also permit a greater variation in policy variables and parameters 
than are present in observational data, thereby making it easier to identify and study 
responses to policy changes. In many cases the social experiment may try out a pol- 
icy that has never been tried, so the observational data remain completely silent on its 
potential impact. 

Social experiments are still rather rare outside the United States, partly because 
they are expensive to run. In the United States a number of such experiments have 
taken place since the early 1970s. Table 3.1 summarizes features of some relatively 
well-known examples; for a more extensive coverage see Burtless (1995). 

An experiment may produce either cross-section or longitudinal data, although cost 
considerations will usually limit the time dimension well below what is typical in ob- 
servational data. When an experiment lasts several years and has multiple stages and/or 
geographical locations, as in the case of RHIE, interim analyses based on “incomplete” 
data are not uncommon (Newhouse et al., 1993). 


3.3.2. Advantages of Social Experiments 


Burtless (1995) surveys the advantages of social experiments with great clarity. 
The key advantage stems from randomized trials that remove any correlation be- 
tween the observed and unobserved characteristics of program participants. Hence the 
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Table 3.1. Features of Some Selected Social Experiments 


Experiment Tested Treatments Target Population 


Rand Health Health insurance plans with Low- and moderate-level 
Insurance Experiment varying copayment rate and income persons and families 
(RHIE), 1974-1982 differing levels of maximum 

out-of-pocket expenses 


Negative Income Tax NIT plans with alternative Low- and moderate-level 
(NIT), 1968-1978 income guarantees and income persons and families 

tax rates with nonaged head of household 
Job Training Job search assistance, Out-of-school youths and 
Partnership Act (JTPA), on-the-job training, classroom disadvantaged adults 
(1986-1994) training financed under JTPA 


contribution of the treatment to the outcome difference between the treated and control 
groups can be estimated without confounding bias even if one cannot control for the 
confounding variables. The presence of correlation between treatment and confound- 
ing variables often plagues observational studies and complicates causal inference. By 
contrast, an experimental study conducted under ideal circumstances can produce a 
consistent estimate of the average difference in outcomes of the treated and nontreated 
groups without much computational complexity. 

If, however, an outcome depends on treatment as well as other observable fac- 
tors, then controlling for the latter will in general improve the precision of the impact 
estimate. 

Even if observational data are available, the generation and use of experimental data 
has great appeal because it offers the possibility of exogenizing a policy variable, and 
randomization of treatments can potentially lead to great simplification of statistical 
analysis. Conclusions based on observational data often lack generality because they 
are based on a nonrandom sample from the population — the problem of selection bias. 
An example is the aforementioned RHIE study whose major focus is on the price re- 
sponsiveness of the demand for health services. Availability of health insurance affects 
the user price of health services and thereby its use. An important policy issue is the ex- 
tent to which “overutilization” of health services would result from subsidized health 
insurance. One can, of course, use observational data to model the relation between 
the demand for health services and the level of insurance. However, such analyses are 
subject to the criticism that the level of health insurance should not be treated as ex- 
ogenous. Theoretical analyses show that the demand for health insurance and health 
care are jointly determined, so causation is not unidirectional. This fact can potentially 
make it difficult to identify the role of health insurance. Treating health insurance as 
exogenous biases the estimate of price responsiveness. However, in an experimental 
setup the participating households could be assigned an insurance policy, making it an 
exogenous variable. The role of insurance is then identifiable. Once the key variable 
of interest is exogenized, the direction of causation becomes clear and the impact of 
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the treatment can be studied unambiguously. Furthermore, if the experiment is free 
from some of the problems that we mention in the following, this greatly simplifies 
statistical analysis relative to what is often necessary in survey data. 


3.3.3. Limitations of Social Experiments 


The application of a nonhuman methodology, initially that is, one developed for and 
applied to nonhuman subjects, to human subjects has generated a lively debate in the 
literature. See especially Heckman and Smith (1995), who argue that many social ex- 
periments may suffer from limitations that apply to observational studies. These is- 
sues concern general points such as the merits of experimental versus observational 
methodology, as well as specific issues concerning the biases and problems inherent 
in the use of human subjects. Several of the issues are covered in more detail in later 
chapters but a brief overview follows. 

Social experiments are very costly to run. Sometimes, perhaps often, they do not 
correspond to “clean” randomized trials. Hence the results from such experiments are 
not always unambiguous and easily interpretable, or free from biases. If the treatment 
variable has many alternative settings of interest, or if extrapolation is an important 
objective, then a very large sample must be collected to ensure sufficient data variation 
and to precisely gauge the effect of treatment variation. In that case the cost of the 
experiment will also increase. If the cost factor prevents a large enough experiment, its 
utility relative to observational studies may be questionable; see the papers by Rosen 
and Stafford in Hausman and Wise (1985). 

Unfortunately the design of some social experiments is flawed. Hausman and Wise 
(1985) argue that the data from the New Jersey negative income tax experiment was 
subject to endogenous stratification, which they describe as follows: 


... [T]he reason for an experiment is, by randomization, to eliminate correlation 
between the treatment variable and other determinants of the response variable that 
is under study. In each of the income-maintenance experiments, however, the exper- 
imental sample was selected in part on the basis of the dependent variable, and the 
assignment to treatment versus control group was based in part on the dependent 
variable as well. In general, the group eligible for selection — based on family status, 
race, age of family head, etc. — was stratified on the basis of income (and other vari- 
ables) and persons were selected from within the strata. (Hausman and Wise, 1985, 
pp. 190-191) 


The authors conclude that, in the presence of endogenous stratification, unbiased es- 
timation of treatment effects is not straightforward. Unfortunately, a fully randomized 
trial in which treatment assignment within a randomly selected experimental group 
from the population is independent of income would be much more costly and may 
not be feasible. 

There are several other issues that detract from the ideal simplicity of a random- 
ized experiment. First, if experimental sites are selected randomly, cooperation of 
administrators and potential participants at that site would be required. If this is not 
forthcoming, then alternative treatment sites where such cooperation is obtainable 
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will be substituted, thereby compromising the random assignment principle; see Hotz 
(1992). 

A second problem is that of sample selection, which is relevant because participa- 
tion is voluntary. For ethical reasons there are many experiments that simply cannot 
be done (e.g., random assignment of students to years of education). Unlike medical 
experiments that can achieve the gold standard of a double-blind protocol, in social 
experiments experimenters and subjects know whether they are in treatment or con- 
trol groups. Furthermore, those in control groups may obtain treatment, (e.g., training) 
from alternative sources. If the decision to participate is uncorrelated with either x or 
£, the analysis of the experimental data is simplified. 

A third problem is sample attrition caused by subjects dropping out of the experi- 
ment after it has started. Even if the initial sample was random the effect of nonran- 
dom attrition may well lead to a problem similar to the attrition bias in panels. Finally, 
there is the problem of Hawthorne effect. The term originates in social psychology 
research conducted jointly by the Harvard Graduate School of Business Administra- 
tion and the management of the Western Electric Company at the latter’s Hawthorne 
works in Chicago from 1926 to 1932. Human subjects, unlike inanimate objects, may 
change or adapt their behavior while participating in the experiment. In this case the 
variation in the response observed under experimental conditions cannot be attributed 
solely to treatment. 

Heckman and Smith (1995) mention several other difficulties in implementing a 
randomized treatment. Because the administration of a social experiment involves a 
bureaucracy, there is a potential for biases. Randomization bias occurs if the assign- 
ment introduces a systematic difference between the experimental participant and the 
participant during its normal operation. Heckman and Smith document the possibilities 
of such bias in actual experiments. Another type of bias, called substitution bias, is 
introduced when the controls may be receiving some form of treatment that substitutes 
for the experimental treatment. Finally, analysis of social experiments is inevitably of 
a partial equilibrium nature. One cannot reliably extrapolate the treatment effects to 
the entire population because the ceteris paribus assumption will not hold when the 
entire population is involved. 

Specifically, the key issue is whether one can extrapolate the results from the exper- 
iment to the population at large. If the experiment is conducted as a pilot program on a 
small scale, but the intention is to predict the impact of policies that are more broadly 
applied, then the obvious limitation is that the pilot program cannot incorporate the 
broader impact of the treatment. A broadly applied treatment may change the eco- 
nomic environment sufficiently to invalidate the predictions from a partial equilibrium 
setup. So the treatment will not be like the actual policy that it mimics. 

In summary, social experiments, in principle, could yield data that are easier to an- 
alyze and to understand in terms of cause and effect than observational data. Whether 
this promise is realized depends on the experimental design. A poor experimen- 
tal design generates its own statistical complications, which affect the precision of 
the conclusions. Social experiments differ fundamentally from those in biology and 
agriculture because human subjects and treatment administrators tend to be both 
active and forward-looking individuals with personal preferences, rather than 
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Table 3.2. Features of Some Selected Natural Experiments 


Experiment Treatments Studied Reference 

Outcomes for identical twins Differences in returns to Ashenfelter and 

with different schooling levels schooling through correlation Krueger (1994) 
between schooling and wages 

Transition to National Health Labor market effects of NHI Gruber and 

Insurance in Canada as Sasketchwan based on comparison of Hanratty (1995) 

moves to NHI and other states provinces with and without NHI 

follow several years later 

New Jersey increases minimum Minimum wage effects on Card and 

wage while neighboring employment Krueger (1994) 


Pennsylvania does not 


passive administrators of a standard protocol or willing recipients of randomly as- 
signed treatment. 


3.4. Data from Natural Experiments 


Sometimes, however, a researcher may have available data from a “natural experi- 
ment.” A natural experiment occurs when a subset of the population is subjected to 
an exogenous variation in a variable, perhaps as a result of a policy shift, that would 
ordinarily be subject to endogenous variation. Ideally, the source of the variation is 
well understood. 

In microeconometrics there are broadly two ways in which the idea of a natural 
experiment is exploited. For concreteness consider the simple regression model 


y= pı + Box +u, (3.4) 


where x is an endogenous treatment variable correlated with u. 

Suppose that there is an exogenous intervention that changes x. Examples of such 
external intervention are administrative rules, unanticipated legislation, natural events 
such as twin births, weather-related shocks, and geographical variation; see Table 3.2 
for examples. Exogenous intervention creates an opportunity for evaluating its im- 
pact by comparing the behavior of the impacted group both pre- and postintervention, 
or with that of a nonimpacted group postintervention. That is, “natural” comparison 
groups are generated by the event that facilitates estimation of the 62. Estimation is 
simplified because x can be treated as exogenous. 

The second way in which a natural experiment can assist inference is by generating 
natural instrumental variables. Suppose z is a variable that is correlated with x, or 
perhaps causally related to x, and uncorrelated with u. Then an instrumental variable 
estimator of 62, expressed in terms of sample covariances, is 


> _ Covlz, y] 


2 (3.5) 


~ Cov[z, x] 
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(see Section 4.8.5). In an observational data setup an instrumental variable with the 
right properties may be difficult to find, but it could arise naturally in a favorable 
natural experiment. Then estimation would be simplified. We consider the first case 
in the next section; the topic of naturally generated instruments will be covered in 
Chapter 25. 


3.4.1. Natural Exogenous Interventions 


Such data are less expensive to collect and they also allow the researcher to evaluate the 
role of some specific factor in isolation, as in a controlled experiment, because “nature” 
holds constant variations attributed to other factors that are not of direct interest. Such 
natural experiments are attractive because they generate treatment and control groups 
inexpensively and in a real-world setting. Whether a natural experiment can support 
convincing inference depends, in part, on whether the supposed natural intervention 
is genuinely exogenous, whether its impact is sufficiently large to be measurable, and 
whether there are good treatment and control groups. Just because a change is legis- 
lated, for example, does not mean that it is an exogenous intervention. However, in 
appropriate cases, opportunistic exploitation of such data sets can yield valuable em- 
pirical insights. 

Investigations based on natural experiments have several potential limitations 
whose importance in any given study can only be assessed through a careful con- 
sideration of the relevant theory, facts, and institutional setting. Following Campbell 
(1969) and Meyer (1995), these are grouped into limitations that affect a study’s inter- 
nal validity (i.e., the inferences about policy impact drawn from the study) and those 
that affect a study’s external validity (i.e., the generalization of the conclusions to other 
members of the population). 

Consider an investigation of a policy change in which conclusions are drawn from 
a comparison of pre- and postintervention data, using the regression method briefly 
described in the following and in greater detail in Chapter 25. In any study there will 
be omitted variables that may have also changed in the time interval between policy 
change and its impact. The characteristics of sampled individuals such as age, health 
status, and their actual or anticipated economic environment may also change. These 
omitted factors will directly affect the measured impact of the policy change. Whether 
the results can be generalized to other members of the population will depend on the 
absence of bias due to nonrandom sampling, existence of significant interaction effects 
between the policy change and its setting, and an absence of the role of historical 
factors that would cause the impact to vary from one situation to another. Of course, 
these considerations are not unique to data from natural experiments; rather, the point 
is that the latter are not necessarily free from these problems. 


3.4.2. Differences in Differences 


One simple regression method is based on a comparison of outcomes in one group 
before and after a policy intervention. For example, consider 


Yr =Q + D; +e, t= 1,...,N, t=0,1, 
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where D, = 1 in period 1 (postintervention), D, = 0 in period 0 (preintervention), and 
y;, Measures the outcome. The regression estimated from the pooled data will yield an 
estimate of policy impact parameter $. This is easily shown to be equal to the average 
difference in the pre- and postintervention outcome, 


B=N! om (vi — Yio) 
= yı — Yo- 


The one-group before and after design makes the strong assumption that the group 
remains comparable over time. This is required for identifiability of 6. If, for exam- 
ple, we allowed a to vary between the two periods, 8 would no longer be identified. 
Changes in a are confounded with the policy impact. 

One way to improve on the previous design is to include an additional untreated 
comparison group, that is, one not impacted by policy, and for which the data are avail- 
able in both periods. Using Meyer’s (1995) notation, the relevant regression now is 


yj, =a+a,D,+a'Di+pD) +e), i=1,...,N,t=0,1, 
where j is the group superscript, Dİ = 1 if j equals 1 and Dİ = 0 otherwise, Dj =1 
if both j and ż equal 1 and D7 = 0 otherwise, and € is a zero-mean constant-variance 
error term. The equation does not include covariates but they can be added, and those 


that do not vary are already subsumed under œ. This relation implies that, for the 
treated group, we have preintervention 


Yio =% +a'D' + Eig 
and postintervention 
yA =ata,+a'D!'+B+4+ Eas 
The impact is therefore 
Yi — Yio = A1 +B + Ej — Ejo: (3.6) 
The corresponding equations for the untreated group are 


o 0 
Yio = Q + Ej; 


yy = ator + eh, 
and hence the difference is 


yi — Yio = 1 + ey — Ep. (3.7) 


Both the first-difference equations include the period-1 specific effect œ}, which can 
be eliminated by taking the difference between Equations (3.6) and (3.7): 


(vi z Yio) = (yi = Yio) =B+ (el = Ein) = (e°, = eh) . (3.8) 


Assuming that E[(e}, — e/)) — (e9, — £% )] equals zero, we can obtain an unbiased 
estimate of 6 by the sample average of (y4 — yh) — (y9, — y%). This method uses 
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differences in differences. If time-varying covariates are present, they can be 
included in the relevant equations and their differences will appear in the regression 
equation (3.8). 

For simplicity our analysis ignored the possibility that there remain observable dif- 
ferences in the distribution of characteristics between the treatment and control groups. 
If so, then such differences must be controlled for. The standard solution is to include 
such controlling variables in the regression. 

An example of a study based on a natural experiment is that of Ashenfelter and 
Krueger (1994). They estimate the returns to schooling by contrasting the wage rates 
of identical twins with different schooling levels. In this case running a regular exper- 
iment in which individuals are exogenously assigned different levels of schooling is 
simply not feasible. Nonetheless, some experimental-type controls are needed. As the 
authors explain: 


Our goal is to ensure that the correlation we observe between schooling and wage 
rates is not due to a correlation between schooling and a worker’s ability or other 
characteristics. We do this by taking advantage of the fact that monozygotic twins 
are genetically identical and have similar family backgrounds. 


Data on twins have served as a basis for a number of other econometric studies 
(Rosenzweig and Wolpin, 1980; Bronars and Grogger, 1994). Since the twinning prob- 
ability in the population is not high, an important issue is generating a sufficiently 
large representative sample, allowing for some nonresponse. One source of such data 
is the census. Another source is the “twins festivals” that are held in the United States. 
Ashenfelter and Krueger (1994, p. 1158) report that their data were obtained from in- 
terviews conducted at the 16th Annual Twins Day Festival, Twinsburg, Ohio, August 
1991, which is the largest gathering of twins, triplets, and quadruplets in the world. 

The attraction of using the twins data is that the presence of common effects from 
both observable and unobservable factors can be eliminated by modeling the differ- 
ences between the outcomes of the twins. For example, Ashenfelter and Krueger esti- 
mate a regression model of the difference in the log of wage rates between the first and 
the second twin. The first differencing operation eliminates the effects of age, gender, 
ethnicity, and so forth. The remaining explanatory variables are differences between 
schooling levels, which is the variable of main interest, and variables such as differ- 
ences in years of tenure and marital status. 


3.4.3. Identification through Natural Experiments 


The natural experiments school has had a useful impact on econometric practice. By 
encouraging the opportunistic exploitation of quasi-experimental data, and by using 
modeling frameworks such as the POM of Chapter 2, econometric practice bridges the 
gap between observational and experimental data. The notions of parameter identifica- 
tion rooted in the SEM framework are broadened to include identification of measures 
that are interesting from a policy viewpoint. The main advantage of using data from a 
natural experiment is that a policy variable of interest might be validly treated as ex- 
ogenous. However, in using data from natural experiments, as in the case of social 
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experiments, the choice of control groups plays a critical role in determining the 
reliability of the conclusions. Several potential problems that affect a social experi- 
ment, such as selectivity and attrition bias, will also remain potential problems in the 
case of natural experiments. Only a subset of interesting policy problems may lend 
themselves to analysis within the natural experiment framework. The experiment may 
apply only to a small part of the population, and the conditions under which it occurs 
may not replicate themselves easily. An example given in Section 22.6 illustrates this 
point in the context of difference in differences. 


3.5. Practical Considerations 


Although there has been an explosion in the number and type of microdata sets that 
are available, certain well-established databases have supported numerous studies. We 
provide a very partial list of some of very well known U.S. micro databases. For fur- 
ther details, see the respective Web sites for these data sets or the data clearinghouses 
mentioned in the following. Many of these allow you to download the data directly. 


3.5.1. Some Sources of Microdata 


Panel Study in Income Dynamics (PSID): Based at the Survey Research Center at 
the University of Michigan, PSID is a national survey that has been running since 
1968. Today it covers over 40,000 individuals and collects economic and demo- 
graphic data. These data have been used to support a wide variety of microecono- 
metric analyses. Brown, Duncan and Stafford (1996) summarize recent develop- 
ments in PSID data. 


Current Population Survey (CPS): This is a monthly national survey of about 50,000 
households that provides information on labor force characteristics. The survey has 
been conducted for more than 50 years. Major revisions in the sample have fol- 
lowed each of the decennial censuses. For additional details about this survey see 
Section 24.2. It is the basis of many federal government statistics on earnings and 
unemployment. It is also an important source of microdata that have supported nu- 
merous studies especially of labor markets. The survey was redesigned in 1994 
(Polivka, 1996). 


National Longitudinal Survey (NLS): The NLS has four original cohorts: NLS Older 
Men, NLS Young Men, NLS Mature Women, and NLS Young Women. Each of 
the original cohorts is a national yearly survey of over 5,000 individuals who have 
been repeatedly interviewed since the mid-1960s. Surveys collect information on 
each respondent’s work experiences, education, training, family income, household 
composition, marital status, and health. Supplementary data on age, sex, etc. are 
available. 


National Longitudinal Surveys of Youth (NLSY): The NLSY is a national annual 
survey of 12,686 young men and young women who where 14 to 22 years of age 
when they were first surveyed in 1979. It contains three subsamples. The data 
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provide a unique opportunity to study the life-course experiences of a large sam- 
ple of young adults who are representative of American men and women born in 
the late 1950s and early 1960s. A second NLSY began in 1997. 


Survey of Income and Program Participation (SIPP): SIPP is a longitudinal survey 
of around 8,000 housing units per month. It covers income sources, participation in 
entitlement programs, correlation between these items, and individual attachments 
to the job market over time. It is a multipanel survey with a new panel being intro- 
duced at the beginning of each calendar year. The first panel of SIPP was initiated 
in October 1983. Compared with CPS, SIPP has fewer employed and more unem- 
ployed persons. 


Health and Retirement Study (HRS): The HRS is a longitudinal national study. 
The baseline consists of interviews with members of 7,600 households in 1992 
(respondents aged from 51 to 61) with follow-ups every two years for 12 years. The 
data contain a wealth of economic, demographic, and health information. 


World Bank’s Living Standards Measurement Study (LSMS): The World Bank’s 
LSMS household surveys collect data “on many dimensions of household well- 
being that can be used to assess household welfare, understand household behavior, 
and evaluate the effects of various government policies on the living conditions of 
the population” in many developing countries. Many examples of the use of these 
data can be found in Deaton (1997) and in the economic development literature. 
Grosh and Glewwe (1998) outline the nature of the data and provide references to 
research studies that have used them. 


Data clearinghouses: The Interuniversity Consortium for Political and Social Re- 
search (ICPSR) provides access to many data sets, including the PSID, CPS, NLS, 
SIPP, National Medical Expenditure Survey (NMES), and many others. The U.S. 
Bureau of Labor Statistics handles the CPS and NLS surveys. The U.S. Bureau of 
Census handles the SIPP. The U.S. National Center for Health Statistics provides 
access to many health data sets. A useful gateway to European data archives is 
the Council of European Social Science Data Archives (CESSDA), which provides 
links to several European national data archives. 


Journal data archives: For some purposes, such as replication of published results 
for classroom work, you can get the data from journal archives. Two archives in 
particular have well-established procedures for data uploads and downloads using 
an Internet browser. The Journal of Business and Economic Statistics archives data 
used in most but not all articles published in that journal. The Journal of Applied 
Econometrics data archive is also organized along similar lines and contains data 
pertaining to most articles published since 1994. 


3.5.2. Handling Microdata 


Microeconomic data sets tend to be quite large. Samples of several hundreds or thou- 
sands are common and even those of tens of thousands are not unusual. The distribu- 
tions of outcomes of interest are often nonnormal, in part because one is often dealing 
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with discrete data such as binary outcomes, or with data that have limited variation 
such as proportions or shares, or with truncated or censored continuous outcomes. 
Handling large nonnormal data sets poses some problems of summarizing and report- 
ing the important features of data. Often it is useful to use one computing environment 
(program) for data extraction, reduction, and preparation and a different one for model 
estimation. 


3.5.3. Data Preparation 


The most basic feature of microeconometric analysis is that the process of arriving at 
the sample finally used in the econometric investigation is likely to be a long one. It 
is important to accurately document decisions and choices made by the investigator in 
the process of “cleaning up” the data. Let us consider some specific examples. 

One of the most common features of sample survey data is nonresponse or pat- 
tial response. The problems of nonresponse have already been discussed. Partial res- 
ponse usually means that some parts of survey questionnaires were not answered. If 
this means that some of the required information is not available, the observations in 
question are deleted. This is called listwise deletion. If this problem occurs in a sig- 
nificant number of cases, it should be properly analyzed and reported because it could 
lead to an unrepresentative sample and biases in estimation. The issue is analyzed in 
Chapter 27. For example, consider a question in a household survey to which high- 
income households do not respond, leading to a sample in which these households are 
underrepresented. Hence the end effect is no different from one in which there is a full 
response but the sample is not representative. 

A second problem is measurement error in reported data. Microeconomic data are 
typically noisy. The extent, type, and seriousness of measurement error depends on the 
type of survey cross section or panel, the individual who responds to the survey, and 
the variable about which information is sought. For example, self-reported income data 
from panel surveys are strongly suspected to have serially correlated measurement er- 
ror. In contrast, reported expenditure magnitudes are usually thought to have a smaller 
measurement error. Deaton (1997) surveys some of the sources of measurement er- 
ror with special reference to the World Bank’s Living Standards Measurement Survey, 
although several of the issues raised have wider relevance. The biases from measure- 
ment error depend on what is done to the data in terms of transformations (e.g., first 
differencing) and the estimator used. Hence to make informative statements about the 
seriousness of biases from measurement error, one must analyze well-defined mod- 
els. Later chapters will give examples of the impact of measurement error in specific 
contexts. 


3.5.4. Checking Data 


In large data sets it is easy to have erroneous data resulting from keyboard and cod- 
ing errors. One should therefore apply some elementary checks that would reveal the 
existence of problems. One can check the data before analyzing it by examining some 
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descriptive statistics. The following techniques are useful. First, use summary statistics 
(min, max, mean, and median) to make sure that the data are in the proper interval and 
on the proper scale. For instance, categorical variables should be between zero and 
one, counts should be greater than or equal to zero. Sometimes missing data are coded 
as —999, or some other integer, so take care not to treat these entries as data. Second, 
one should know whether changes are fractional or on a percentage scale. Third, use 
box and whisker plots to identify problematic observations. For instance, using box 
and whisker plots one researcher found a country that had negative population growth 
(owing to a war) and another country that had recorded investment as more than GDP 
(because foreign aid had been excluded from the GDP calculation). Checking observa- 
tions before proceeding with estimation may also suggest normalizing transformations 
and/or distributional assumptions with features appropriate for modeling a particular 
data set. Third, screening data may suggest appropriate data transforms. For example, 
box and whisker plots and histograms could suggest which variables might be better 
modeled via a log or power transform. Finally, it may be important to check the scales 
of measurement. For some purposes, such as the use of nonlinear estimators, it may 
be desirable to scale variables so that they have roughly similar scale. Summary statis- 
tics can be used to check that the means, variances, and covariances of the variables 
indicate proper scaling. 


3.5.5. Presenting Descriptive Statistics 


Because microdata sets are usually large, it is essential to provide the reader with an 
initial table of descriptive statistics, usually mean, standard deviation, minimum, and 
maximum for every variable. In some cases unexpectedly large or small values may 
reveal the presence of a gross recording error or erroneous inclusion of an incorrect 
data point. Two-way scatter diagrams are usually not helpful, but tabulation of cate- 
gorical variables (contingency tables) can be. For discrete variables histograms can be 
useful and for continuous variables density plots can be informative. 


3.6. Bibliographic Notes 


3.2 Deaton (1997) provides an introduction to sample surveys especially for developing 
economies. Several specific references to complex surveys are provided in Chapter 24. 
Becketti et al. (1988) investigate the importance of the issue of representativeness of the 
PSID. 

3.3 The collective volume edited by Hausman and Wise (1985) contains several papers on indi- 
vidual social experiments including the RHIE, NIT, and Time-of-Use pricing experiments. 
Several studies question the usefulness of the experimental data and there is extensive dis- 
cussion of the flaws in experimental designs that preclude clear conclusions. Pros and cons 
of social experiments versus observational data are discussed in an excellent pair of papers 
by Burtless (1995) and Heckman and Smith (1995). 

3.4 A special issue of the Journal of Business and Economic Statistics (1995) carries a number 
of articles that use the methodology of quasi- or natural experiments. The collection in- 
cludes an article by Meyer who surveys the issues in and the methodology of econometric 
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studies that use data from natural experiments. He also provides a valuable set of guidelines 
on the credible use of natural variation in making inferences about the impact of economic 
policies, partly based on the work of Campbell (1969). Kim and Singal (1993) study the 
impact of changes in market concentration on price using the data generated by a airline 
mergers. Rosenzweig and Wolpin (2000) review an extensive literature based on natural 
experiments such as identical twins. Isacsson (1999) uses the twins approach to study re- 
turns to schooling using Swedish data. Angrist and Lavy (1999) study the impact of class 
size on test scores using data from schools that are subject to “Maimonides” Rule” (briefly 
reviewed in Section 25.6), which states that class size should not exceed 40. The rule gen- 
erates an instrument. 
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Core Methods 


Part 2 presents the core estimation methods — least squares, maximum likelihood and 
method of moments — and associated methods of inference for nonlinear regression 
models that are central in microeconometrics. The material also includes modern top- 
ics such as quantile regression, sequential estimation, empirical likelihood, semipara- 
metric and nonparametric regression, and statistical inference based on the bootstrap. 
In general the discussion is at a level intended to provide enough background and 
detail to enable the practitioner to read and comprehend articles in the leading econo- 
metrics journals and, where needed, subsequent chapters of this book. We presume 
prior familiarity with linear regression analysis. 

The essential estimation theory is presented in three chapters. Chapter 4 begins with 
the linear regression model. It then covers at an introductory level quantile regression, 
which models distributional features other than the conditional mean. It provides a 
lengthy expository treatment of instrumental variables estimation, a major method of 
causal inference. Chapter 5 presents the most commonly-used estimation methods for 
nonlinear models, beginning with the topic of m-estimation, before specialization to 
maximum likelihood and nonlinear least squares regression. Chapter 6 provides a com- 
prehensive treatment of generalized method of moments, which is a quite general esti- 
mation framework that is applicable for linear and nonlinear models in single-equation 
and multi-equation settings. The chapter emphasizes the special case of instrumental 
variables estimation. 

We then turn to model testing. Chapter 7 covers both the classical and bootstrap 
approaches to hypothesis testing, while Chapter 8 presents relatively more modern 
methods of model selection and specification analysis. Because of their importance 
the computationally-intensive bootstrap methods are also the subject of a more de- 
tailed chapter, Chapter 11 in Part 3. A distinctive feature of this book is that, as much 
as possible, testing procedures are presented in a unified manner in just these three 
chapters. The procedures are then illustrated in specific applications throughout the 
book. 

Chapter 9 is a stand-alone chapter that presents nonparametric and semiparametric 
estimation methods that place a flexible structure on the econometric model. 
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Chapter 10 presents the computational methods used to compute the nonlinear esti- 
mators presented in chapters 5 and 6. This material becomes especially relevant to the 
practitioner if an estimator is not automatically computed by an econometrics package, 
or if numerical difficulties are encountered in model estimation. 
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CHAPTER 4 
Linear Models 


4.1. Introduction 


A great deal of empirical microeconometrics research uses linear regression and its 
various extensions. Before moving to nonlinear models, the emphasis of this book, 
we provide a summary of some important results for the single-equation linear regres- 
sion model with cross-section data. Several different estimators in the linear regression 
model are presented. 

Ordinary least-squares (OLS) estimation is especially popular. For typical microe- 
conometric cross-section data the model error terms are likely to be heteroskedas- 
tic. Then statistical inference should be robust to heteroskedastic errors and efficiency 
gains are possible by use of weighted rather than ordinary least squares. 

The OLS estimator minimizes the sum of squared residuals. One alternative is to 
minimize the sum of the absolute value of residuals, leading to the least absolute de- 
viations estimator. This estimator is also presented, along with extension to quantile 
regression. 

Various model misspecifications can lead to inconsistency of least-squares estima- 
tors. In such cases inference about economically interesting parameters may require 
more advanced procedures and these are pursued at considerable length and depth else- 
where in the book. One commonly used procedure is instrumental variables regression. 
The current chapter provides an introductory treatment of this important method and 
additionally addresses the complication of weak instruments. 

Section 4.2 provides a definition of regression and presents various loss functions 
that lead to different estimators for the regression function. An example is introduced 
in Section 4.3. Some leading estimation procedures, specifically ordinary least squares, 
weighted least squares, and quantile regression, are presented in, respectively, Sec- 
tions 4.4, 4.5, and 4.6. Model misspecification is considered in Section 4.7. Instru- 
mental variables regression is presented in Sections 4.8 and 4.9. Sections 4.3-4.5, 4.7, 
and 4.8 cover standard material in introductory courses, whereas Sections 4.2, 4.6, and 
4.9 introduce more advanced material. 
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4.2. Regressions and Loss Functions 


In modern microeconometrics the term regression refers to a bewildering range of 
procedures for studying the relationship between an outcome variable y and a set of 
regressors x. It is helpful, therefore, to state at the beginning the motivation and justi- 
fication for some of the leading types of regressions. 

For exposition it is convenient to think of the purpose of regression to be condi- 
tional prediction of y given x. In practice, regression models are also used for other 
purposes, most notably causal inference. Even then a prediction function constitutes a 
useful data summary and is still of interest. In particular, see Section 4.2.3 for the dis- 
tinction between linear prediction and causal inference based on a linear causal mean. 


4.2.1. Loss Functions 


Let y denote the predictor defined as a function of x. Let e = y — Y denote the pre- 
diction error, and let 


L(e) = L(y — 9) (4.1) 


denote the loss associated with the error e. As in decision analysis we assume that the 
predictor forms the basis of some decision, and the prediction error leads to disutility 
on the part of the decision maker that is captured by L(e), whose precise functional 
form is a choice of the decision maker. The loss function has the property that it is 
increasing in |e|. 

Treating (y, Y) as random, the decision maker minimizes the expected value of the 
loss function, denoted E[L(e)] . If the predictor depends on x, a K-dimensional vector, 
then expected loss is expressed as 


EILO — y)Ix)]. (4.2) 


The choice of the loss function should depend in a substantive way on the losses 
associated with prediction errors. In some situations, such as weather forecasting, there 
may be a sound basis for choosing one loss function over another. 

In econometrics, there is often no clear guide and the convention is to specify 
quadratic loss. Then (4.1) specializes to L(e) = e? and by (4.2) the optimal predic- 
tor minimizes the expected loss E[L(e|x)] = E[e?|x]. It follows that in this case the 
minimum mean-squared prediction error criterion is used to compare predictors. 


4.2.2. Optimal Prediction 


The decision theory approach to choosing the optimal predictor is framed in terms of 
minimizing expected loss, 


mine [L(y — y)|x)]. 


Thus the optimality property is relative to the loss function of the decision maker. 
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Table 4.1. Loss Functions and Corresponding Optimal Predictors 


Type of Loss Function Definition Optimal Predictor 
Squared error loss L(e) = e? ELy|x] 
Absolute error loss Lie) = Jel med[y|x] 
. _ Jd—a)lel ife <0 
Asymmetric absolute loss L(e) = alel ake =O da Ly|x] 
Oife <0 
Step loss L(e) = life20 mod[y|x] 


Four leading examples of loss function, and the associated optimal predictor func- 
tion, are given in Table 4.1. We provide a brief presentation for each in turn. A detailed 
analysis is given in Manski (1988a). 

The most well known loss function is the squared error loss (or mean-square loss) 
function. Then the optimal predictor of y is the conditional mean function, E[y|x]. In 
the most general case no structure is placed on E[y|x] and estimation is by nonpara- 
metric regression (see Chapter 9). More often a model for E[y|x] is specified, with 
E[y|x] = g(x, 6), where g(-) is a specified function and £ is a finite-dimensional vec- 
tor of parameters that needs to be estimated. The optimal prediction is Ş = g(x, B), 
where Bi is chosen to minimize the in-sample loss 


N N N 
DLE) =e = 0- ga, MY. 

i=l i=l i=l 
The loss function is the sum of squared residuals, so estimation is by nonlinear least 
squares (see Section 5.8). If the conditional mean function g(- )i is restricted to be linear 
in x and (3, so that E[y|x] = x’, then the optimal predictor is Y = xB, where Bi is the 
ordinary least-squares estimator detailed in Section 4.4. 

If the loss criterion is absolute error loss, then the optimal predictor is the con- 
ditional median, denoted med[y|x]. If the conditional median function i is linear, so 
that med[y|x] = x’, then the optimal predictor is Y = x'B, where Bi is the least abso- 
lute deviations estimator that minimizes }°; |y; — x; 6|. This estimator is presented in 
Section 4.6. 

Both the squared error and absolute error loss functions are symmetric, so the same 
penalty is imposed for prediction error of a given magnitude regardless of the direc- 
tion of the prediction error. Asymmetric absolute error loss instead places a penalty 
of (1 — a) Je] on overprediction and a different penalty a |e| on underprediction. The 
asymmetry parameter «œ is specified. It lies in the interval (0, 1) with symmetry when 
a = 0.5 and increasing asymmetry as œ approaches 0 or 1. The optimal predictor can 
be shown to be the conditional quantile, denoted qa [y|x]; a special case is the condi- 
tional median when a = 0.5. Conditional quantiles are defined in Section 4.6, which 
presents quantile regression (Koenker and Bassett, 1978). 

The last loss function given in Table 4.1 is step loss, which bases the loss simply on 
the sign of the prediction error regardless of the magnitude. The optimal predictor is the 
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conditional mode, denoted mod[y|x]. This provides motivation for mode regression 
(Lee, 1989). 

Maximum likelihood does not fall as easily into the prediction framework of this 
section. It can, however, be given an expected loss interpretation in terms of predicting 
the density and minimizing Kullback—Liebler information (see Section 5.7). 

The results just stated imply that the econometrician interested in estimating a pre- 
diction function from the data (y, x) should choose the prediction function according 
to the loss function. The use of the popular linear regression implies, at least implicitly, 
that the decision maker has a quadratic loss function and believes that the conditional 
mean function is linear. However, if one of the other three loss functions is specified, 
then the optimal predictor will be based on one of the three other types of regressions. 
In practice there can be no clear reason for preferring a particular loss function. 

Regressions are often used as data summaries, rather than for prediction per se. 
Then it can be useful to consider a range of estimators, as alternative estimators may 
provide useful information about the sensitivity of estimates. Manski (1988a, 1991) 
has pointed out that the quadratic and absolute error loss functions are both convex. If 
the conditional distribution of y|x is symmetric then the conditional mean and median 
estimators are both consistent and can be expected to be quite close. Furthermore, if 
one avoids assumptions about the distribution of y|x, then differences in alternative 
estimators provide a way of learning about the data distribution. 


4.2.3. Linear Prediction 


The optimal predictor under squared error loss is the conditional mean E[y|x]. If this 
conditional mean is linear in x, so that EL y|x] = x’G, the parameter 8 has a structural 
or causal interpretation and consistent estimation of 8 by OLS implies consistent esti- 
mation of E[y|x] = x’. This permits meaningful policy analysis of effects of changes 
in regressors on the conditional mean. 

If instead the conditional mean is nonlinear in x, so that E[y|x] 4 x’, the structural 
interpretation of OLS disappears. However, it is still possible to interpret 8 as the best 
linear predictor under squared error loss. Differentiation of the expected loss E[(y — 
x’ BF] with respect to 8 yields first-order conditions —2E[x(y — x’)] = 0, so the opti- 
mal linear predictor is 8 = (E[xx']) 'E[xy] with sample analogue the OLS estimator. 

Usually we specialize to models with intercept. In a change of notation we define x 
to denote regressors excluding the intercept, and we replace x' 8 by w + x'y. The first- 
order conditions with respect to œ and y are that —2E[u] = 0 and —2E[xu] = 0, where 
u = y —(a+x’'4). These imply that E[u] = 0 and Cov[x,u] = 0. Solving yields 


+ = (VIX)! Covix, yl, (4.3) 
a = Ely]-E[x’}7; 


see, for example, Goldberger (1991, p. 52). 
From the derivation of (4.3) it should be clear that for data (y, x) we can always 
write a linear regression model 


y=a+xy+u, (4.4) 
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where the parameters «œ and y are defined in (4.3) and the error term u satisfies E[u] = 
0 and Cov[x,u] = 0. 

A linear regression model can therefore always be given the nonstructural or re- 
duced form interpretation as the best linear prediction (or linear projection) un- 
der squared error loss. However, for the conditional mean to be linear in x, so that 
E[y|x] = a+x’y, requires the assumption that E[u|x] = 0, in addition to E[u] = 0 and 
Cov[x,u] = 0. 

This distinction is of practical importance. For example, if E[u|x] = 0, so that 
ELy|x] =a@+x’y, then the probability limit of a least-squares (LS) estimator ¥ is ~y 
regardless of whether the LS estimator is weighted or unweighted, or whether the 
sample is obtained by simple random sampling or by exogenous stratified sampling. If 
instead E[ y|x] 4a+x’y then these different LS estimators may have different proba- 
bility limits. This example is discussed further in Section 24.3. 

A structural interpretation of OLS requires that the conditional mean of the error 
term, given regressors, equals zero. 


4.3. Example: Returns to Schooling 


A leading linear regression application from labor economics concerns measuring the 
impact of education on wages or earnings. 
A typical returns to schooling model specifies 


ln w; = asi +X; b+ ui, i=1,..,N, (4.5) 


where w denotes hourly wage or annual earnings, s denotes years of completed school- 
ing, and x» denotes control variables such as work experience, gender, and family 
background. The subscript i denotes the ith person in the sample. Since the dependent 
variable is log wage, the model is a log-linear model and the coefficient œ measures 
the proportionate change in earnings associated with a one-year increase in education. 

Estimation of this model is most often by ordinary least squares. The transforma- 
tion to In w in practice ensures that errors are approximately homoskedastic, but it 
is still best to obtain heteroskedastic consistent standard errors as detailed in Sec- 
tion 4.4. Estimation can also be by quantile regression (see Section 4.6), if interest 
lies in distributional issues such as behavior in the lower quartile. 

The regression (4.5) can be used immediately in a descriptive manner. For exam- 
ple, if @ = 0.10 then a one-year increase in schooling is associated with 10% higher 
earnings, controlling for all the factors included in x». It is important to add the last 
qualifier as in this example the estimate @ usually becomes smaller as x, is expanded 
to include additional controls likely to influence earnings. 

Policy interest lies in determining the impact of an exogenous change in schooling 
on earnings. However, schooling is not randomly assigned; rather, it is an outcome that 
depends on choices made by the individual. Human capital theory treats schooling as 
investment by individuals in themselves, and « is interpreted as a measure of return to 
human capital. The regression (4.5) is then a regression of one endogenous variable, 
y, on another, s, and so does not measure the causal impact of an exogenous change 
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in s. The conditional mean function here is not causally meaningful because one is 
conditioning on a factor, schooling, that is endogenous. Indeed, unless we can argue 
that s is itself a function of variables at least one of which can vary independently of 
u, it is unclear just what it means to regard œ as a causal parameter. 

Such concern about endogenous regressors with observational data on individuals 
pervades microeconometric analysis. The standard assumptions of the linear regres- 
sion model given in Section 4.4 are that regressors are exogenous. The consequences 
of endogenous regressors are considered in Section 4.7. One method to control for 
endogenous regressors, instrumental variables, is detailed in Section 4.8. A recent ex- 
tensive review of ways to control for endogeneity in this wage—schooling example is 
given in Angrist and Krueger (1999). These methods are summarized in Section 2.8 
and presented throughout this book. 


4.4. Ordinary Least Squares 


The simplest example of regression is the OLS estimator in the linear regression model. 

After first defining the model and estimator, a quite detailed presentation of the 
asymptotic distribution of the OLS estimator is given. The exposition presumes pre- 
vious exposure to a more introductory treatment. The model assumptions made here 
permit stochastic regressors and heteroskedastic errors and accommodate data that are 
obtained by exogenous stratified sampling. 

The key result of how to obtain heteroskedastic-robust standard errors of the OLS 
estimator is given in Section 4.4.5. 


4.4.1. Linear Regression Model 


In a standard cross-section regression model with N observations on a scalar 
dependent variable and several regressors, the data are specified as (y, X), where y 
denotes observations on the dependent variable and X denotes a matrix of explanatory 
variables. 

The general regression model with additive errors is written in vector notation as 


y = E[y|X] + u, (4.6) 


where E[y|X] denotes the conditional expectation of the random variable y given X, 
and u denotes a vector of unobserved random errors or disturbances. The right-hand 
side of this equation decomposes y into two components, one that is deterministic 
given the regressors and one that is attributed to random variation or noise. We think 
of E[y|X] as a conditional prediction function that yields the average value, or more 
formally the expected value, of y given X. 

A linear regression model is obtained when E[y|X] is specified to be a linear func- 
tion of X. Notation for this model has been presented in detail in Section 1.6. In vector 
notation the ith observation is 


y= x, B+ui, (4.7) 
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where x; is a K x 1 regressor vector and 8 is a K x | parameter vector. At times 
it is simpler to drop the subscript i and write the model for typical observation as 
y =x’ + u. In matrix notation the N observations are stacked by row to yield 


y=X6+u, (4.8) 


where y is an N x 1 vector of dependent variables, X is an N x K regression ma- 
trix, and u is an N x 1 error vector. 

Equations (4.7) and (4.8) are equivalent expressions for the linear regression model 
and will be used interchangeably. The latter is more concise and is usually the most 
convenient representation. 

In this setting y is referred to as the dependent variable or endogenous variable 
whose variation we wish to study in terms of variation in x and u; u is referred to as 
the error term or disturbance term; and x is referred to as regressors or predictors 
or couariates. If Assumption 4 in Section 4.4.6 holds, then all components of x are 
exogenous variables or independent variables. 


4.4.2. OLS Estimator 


The OLS estimator is defined to be the estimator that minimizes the sum of squared 
errors 


N 
yu? = u'u = (y — XBY (y — Xf). (4.9) 


Setting the derivative with respect to G equal to 0 and solving for 8 yields the OLS 
estimator, 


Bors = (XX) X'y, (4.10) 


see Exercise 4.5 for a more general result, where it is assumed that the matrix inverse of 
X’X exists. If X’X is of less than full rank, the inverse can be replaced by a generalized 
inverse. Then OLS estimation still yields the optimal linear predictor of y given x if 
squared error loss is used, but many different linear combinations of x will yield this 
optimal predictor. 


4.4.3. Identification 


The OLS estimator can always be computed, provided that X’X is nonsingular. The 
more interesting issue is what Bots tells us about the data. 

We focus on the ability of the OLS estimator to permit identification (see Section 
2.5) of the conditional mean E[y|X]. For the linear model the parameter 8 is identified 
if 


1. E[y|X] = XG and 
2. XB") = XB if and only if BY = B. 
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The first condition that the conditional mean is correctly specified ensures that 8 is 
of intrinsic interest; the second assumption implies that X’X is nonsingular, which is 
the same condition needed to compute the unique OLS estimate (4.10). 


4.4.4. Distribution of the OLS Estimator 


We focus on the asymptotic properties of the OLS estimator. Consistency is estab- 
lished and then the limit distribution is obtained by rescaling the OLS estimator. 
Statistical inference then requires consistent estimation of the variance matrix of the 
estimator. The analysis makes extensive use of asymptotic theory, which is summa- 
rized in Appendix A. 


Consistency 


The properties of an estimator depend on the process that actually generated the data, 
the data generating process (dgp). We assume the dgp is y = XG + u, so that the 
model (4.8) is correctly specified. In some places, notably Chapters 5 and 6 and Ap- 
pendix A the subscript 0 is added to 8, so the dgp is y = Xo + u. See Section 5.2.3 
for discussion. 

Then 


Bors = (X’X)'X’y 
= (X’X)'X'(XG + u) 
= (X'X) 'X'XB + (XX)! Xu, 


and the OLS estimator can be expressed as 
Bors = B+ (X'X)"'X'u. (4.11) 
To prove consistency we rewrite (4.11) as 
Bors = B+ (NXX) ' N7'X’u. (4.12) 


The reason for renormalization in the right-hand side is that N~'X’K = N`! 9°, x;x’ 
is an average that converges in probability to a finite nonzero matrix if x; satisfies 
assumptions that permit a law of large numbers to be applied to x;x; (see Section 4.4.8 
for detail). Then 


plim Bors = B + (plim N~!X’X) | (plim N~!X'u) , 
using  Slutsky’s Theorem (Theorem A.3). The OLS estimator is consistent for 8 (i.e., 
plim Bors = 8) if 
plim N~'X’u = 0. (4.13) 


If a law of large numbers can be applied to the average N~'X'u = NT! D>, x;u; then 
a necessary condition for (4.13) to hold is that E[x;u;] = 0. 
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Limit Distribution 


Given consistency, the limit distribution of Bors i is degenerate with all the mass at 8. 
To obtain the limit distribution we multiply Bors by JN, as this rescaling leads to a 
random variable that under standard cross-section assumptions has nonzero yet finite 
variance asymptotically. Then (4.11) becomes 


VN (Bors — b) = (NXX) N1?X"'u. (4.14) 


The proof of consistency assumed that plim N~!X’X exists and is finite and nonzero. 
We assume that a central limit theorem can be applied to N~'/*X’'u to yield a multi- 
variate normal limit distribution with finite, nonsingular covariance matrix. Applying 
the product rule for limit normal distributions (Theorem A.17) implies that the product 
in the right-hand side of (4.14) has a limit normal distribution. Details are provided in 
Section 4.4.8. 

This leads to the following proposition, which permits regressors to be stochastic 
and does not restrict model errors to be homoskedastic and uncorrelated. 


Proposition 4.1 (Distribution of OLS Estimator). Make the following assump- 
tions: 

(i) The dgp is model (4.8), that is, y = XB + u. 

(ii) Data are independent over i with E[u|X] = 0 and E[uu'|X] = Q = Diag[o?1. 
(iti) The matrix X is of full rank so that XB = XB iff BP = B®. 
(iv) The K x K matrix 


bo mee 
Mx = plim N~'X’X = plim X xx; = lim N XC Ex;x;] (4.15) 


i=l i=l 
exists and is finite nonsingular. 


(v) The K x 1 vector N~'/?X'u =N- eso 1 Xi me N [0, Myax], where 
ia 
Myx = plim N~!X’uu’X = plim — 2x:x, eee E[ 2x:x, 
ox =P Daas 2 u; D U; XiX;]. 


(4.16) 
Then the OLS estimator Bor defined in (4.10) is consistent for B and 


VN Bors — B) SN [0, M; MxoxMyy | - (4.17) 


Assumption (i) is used to obtain (4.11). Assumption (ii) ensures E[y|X] = X@ and 
permits heterostedastic errors with variance o7, more general than the homoskedastic 
uncorrelated errors that restrict Q = o7I. Assumption (iii) rules out perfect collinear- 
ity among the regressors. Assumption (iv) leads to the rescaling of X’X by N7! 
(4.12) and (4.14). Note that by a law of large numbers plim = lim E (see Appendix 
Section A.3). 

The essential condition for consistency is (4.13). Rather than directly assume this 
we have used the stronger assumption (v) which is needed to obtain result (4.17). 
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Given that N~!/2X'w has a limit distribution with zero mean and finite variance, mul- 
tiplication by N~!/? yields a random variable that converges in probability to zero and 
so (4.13) holds as desired. Assumption (v) is required, along with assumption (iv), to 
obtain the limit normal result (4.17), which by Theorem A.17 then follows immedi- 
ately from (4.14). More primitive assumptions on u; and x; that ensure (iv) and (v) are 
satisfied are given in Section 4.4.6, with formal proof in Section 4.4.8. 


Asymptotic Distribution 


Proposition 4.1 gives the limit distribution of VN (Bots — QB), a rescaling of Bors: 
Many practitioners prefer to see asymptotic results written directly in terms of the dis- 
tribution of Bors: in which case the distribution is called an asymptotic distribution. 
This asymptotic distribution is interpreted as being applicable in large samples, mean- 
ing samples large enough for the limit distribution to be a good approximation but not 
so large that Bors +. B as then its asymptotic distribution would be degenerate. The 
discussion mirrors that in Appendix A.6.4. 

The asymptotic distribution is obtained from (4.17) by division by VN and addition 
of B. This yields the asymptotic distribution 


Bors ~ N [B,N7'Mg!MxoxMe!], (4.18) 


where the symbol ~ means is “asymptotically distributed as.’ The variance matrix 
in (4.18) is called the asymptotic variance matrix of Bors and is denoted V[Bors]. 
Even simpler notation drops the limits and expectations in the definitions of Myx and 
Mynx and the asymptotic distribution is denoted 


Bors $ N [BAD x axx], (4.19) 


and ViBors] is defined to be the variance matrix in (4.19). 

We use both (4.18) and (4.19) to represent the asymptotic distribution in later chap- 
ters. Their use is for convenience of presentation. Formal asymptotic results for statisti- 
cal inference are based on the limit distribution rather than the asymptotic distribution. 

For implementation, the matrices My, and Mxox in (4.17) or (4.18) are replaced by 
consistent estimates Myx and Myax- Then the estimated asymptotic variance matrix 
of Bors is 

ViBors] = N My MyoxM. (4.20) 


This estimate is called a sandwich estimate, with Myx sandwiched between Mz! 
and My. 


4.4.5. Heteroskedasticity-Robust Standard Errors for OLS 


The obvious choice for Myx in (4.20) is N~'X’X. Estimation of Myox defined in (4.16) 
depends on assumptions made about the error term. 

In microeconometrics applications the model errors are often conditionally het- 
eroskedastic, with V[u;|x;] =E[u?|x;] = of varying over i. White (1980a) proposed 
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using Myox = =Ni DD us 2x:x, . This estimate requires additional assumptions given in 
Section 4.4.8. 

Combining these estimates Myx and Mynx and simplifying yields the estimated 
asymptotic variance matrix estimate 


ViBors] = (X’X)!X’OX(X’X) | (4.21) 
1 


- ($x) Yas (£x) l 


where Q =Diag[a#?] and ñ; = y; — x, B is the OLS residual. This estimate, due to 
White (1980a), is called the heteroskedastic-consistent estimate of the asymptotic 
variance matrix of the OLS estimator, and it leads to standard errors that are called 
heteroskedasticity-robust standard errors, or even more simply robust standard 
ee It provides a consistent estimate of ViBors] even though N? is not consistent 
for o? : 

In introductory courses the errors are restricted to be homoskedastic. Then Q = oI 
so that X'QX = 0?X’X and hence Myox = o7M,x. The limit distribution variance ma- 
trix in (4.17) simplifies to c7>M=!, and many computer packages instead use what is 


XX ? 
sometimes called the default OLS variance estimate 


ViBors] = s°(X'X), (4.22) 


where s? = (N — K)! Y R. 

Inference based on (4.22) rather than (4.21) is invalid, unless errors are ho- 
moskedastic and uncorrelated. In general the erroneous use of (4.22) when errors are 
heteroskedastic, as is often the case for cross-section data, can lead to either inflation 
or deflation of the true standard errors. 

In practice Mox is calculated using division by (N — K), rather than by N, to be 
consistent with the similar division in forming s? in the homoskedastic case. Then 
ViBors] in (4.21) is multiplied by N/(N — K). With heteroskedastic errors there is 
no theoretical basis for this adjustment for degrees of freedom, but some simulation 
studies provide support (see MacKinnon and White, 1985, and Long and Ervin, 2000). 

Microeconometric analysis uses robust standard errors wherever possible. Here the 
errors are robust to heteroskedasticity. Guarding against other misspecifications may 
also be warranted. In particular, when data are clustered the standard errors should 
additionally be robust to clustering; see Sections 21.2.3 and 24.5. 


4.4.6. Assumptions for Cross-Section Regression 


Proposition 4.1 is a quite generic theorem that relies on assumptions about N~!X’X 
and N~!/?X’u. In practice these assumptions are verified by application of laws of 
large numbers and central limit theorems to averages of x;x, and x;u;. These in turn 
require assumptions about how the observations x; and errors u; are generated, and 
consequently how y; defined in (4.7) is generated. The assumptions are referred to 
collectively as assumptions regarding the data-generating process (dgp). A simple 
pedagogical example is given in Exercise 4.4. 
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Our objective at this stage is to make assumptions that are appropriate in many ap- 
plied settings where cross-section data are used. The assumptions, are those in White 
(1980a), and include three important departures from those in introductory treatments. 
First, the regressors may be stochastic (Assumptions 1 and 3 that follow), so assump- 
tions on the error are made conditional on regressors. Second, the conditional variance 
of the error may vary across observations (Assumption 5). Third, the errors are not 
restricted to be normally distributed. 

Here are the assumptions: 


1. The data (y;, x;) are independent and not identically distributed (inid) over i. 


2. The model is correctly specified so that 
yi = x, B+ui ; 


3. The regressor vector x; is possibly stochastic with finite second moment, additionally 
El|xijxix|'*?] < 00 for all j,k = 1,..., K for some ô > 0, and the matrix Myx defined 
in (4.15) exists and is a finite positive definite matrix of rank K. Also, X has rank K in 
the sample being analyzed. 


4. The errors have zero mean, conditional on regressors 
E[u;|x;] = 0. 


5. The errors are heteroskedastic, conditional on regressors, with 


ao? =E [u? |x; | ; 
(4.23) 
Q=E [uu’|X] = Diag [o] , 
where Q is an N x N positive definite matrix. Also, for some ô > 0, E[lu?|!+°] < OO. 


6. The matrix Myo, defined in (4.16) exists and is a finite positive definite matrix of rank 
K, where Mxox = plim N =! ye u?x;x, given independence over i. Also, for some 6 > 
0, E[lu?xijxik|! t] < œ forall j,k = 1,..., K. 


4.4.7. Remarks on Assumptions 


For completeness we provide a detailed discussion of each assumption, before proving 
the key results in the following section. 


Stratified Random Sampling 


Assumption 1 is one that is often implicitly made for cross-section data. Here we make 
it explicit. It restricts (y;, x;) to be independent over i, but permits the distribution to 
differ over i. Many microeconometrics data sets come from stratified random sam- 
pling (see Section 3.2). Then the population is partitioned into strata and random draws 
are made within strata, but some strata are oversampled with the consequence that the 
sampled (y;, x;) are inid rather than iid. If instead the data come from simple ran- 
dom sampling then (y;, x;) are iid, a stronger assumption that is a special case of inid. 
Many introductory treatments assumed that regressors are fixed in repeated samples. 
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Then (y;, x;) are inid since only y; is random with a value that depends on the value of 
x;. The fixed regressors assumption is rarely appropriate for microeconometrics data, 
which are usually observational data. It is used instead for experimental data, where x 
is the treatment level. 

These different assumptions on the distribution of (y;, x;) affect the particular laws 
of large numbers and central limit theorems used to obtain the asymptotic properties 
of the OLS estimator. Note that even if (y;, x;) are iid, y; given x; is not iid since, for 
example, E[y;|x;] = x;@ varies with x;. 

Assumption 1 rules out most time-series data since they are dependent over obser- 
vations. It will also be violated if the sampling scheme involves clustering of observa- 
tions. The OLS estimator can still be consistent in these cases, provided Assumptions 
2—4 hold, but usually it has a variance matrix different from that presented in this 
chapter. 


Correctly Specified Model 


Assumption 2 seems very obvious as it is an essential ingredient in the derivation of 
the OLS estimator. It still needs to be made explicitly, however, since B = (XX)! X’y 
is a function of y and so its properties depend on y. 

If Assumption 2 holds then it is being assumed that the regression model is linear in 
x, rather than nonlinear, that there are no omitted variables in the regression, and that 
there is no measurement error in the regressors, as the regressors x used to calculate 
B are the same regressors x that are in the dgp. Also, the parameters 8 are the same 
across individuals, ruling out random parameter models. 

If Assumption 2 fails then OLS can only be interpreted as an optimal linear predic- 
tor; see Section 4.2.3. 


Stochastic Regressors 


Assumption 3 permits regressors to be stochastic regressors, as is usually the case 
when survey data rather than experimental data are used. It is assumed that in the limit 
the sample second-moment matrix is constant and nonsingular. 

If the regressors are iid, as is assumed under simple random sampling, then 
M,x =E[xx’] and Assumption 3 can be reduced to an assumption that the second 
moment exists. If the regressors are stochastic but inid, as is the case for stratified 
random sampling, then we need the stronger Assumption 3, which permits applica- 
tion of the Markov LLN to obtain plim N~'X’X. If the regressors are fixed in repeated 
samples, the common less-satisfactory assumption made in introductory courses, then 
M,x = lim N~!X’X and Assumption 3 becomes assumption that this limit exists. 


Weakly Exogenous Regressors 


Assumption 4 of zero conditional mean errors is crucial because when combined 
with Assumption 2 it implies that E[y|X] = XØ, so that the conditional mean is indeed 
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The assumption that E[u|x] = 0 implies that Cov[x,u] = 0, so that the error is un- 
correlated with regressors. This follows as Cov[x,u] =E[xu]—E[x]E[u] and E[u|x] = 
0 implies E[xu] = 0 and E[u] = 0 by the law of iterated expectations. The weaker 
assumption that Cov[x,u] = 0 can be sufficient for consistency of OLS, whereas the 
stronger assumption that E[u|x] = 0 is needed for unbiasedness of OLS. 

The economic meaning of Assumption 4 is that the error term represents all the 
excluded factors that are assumed to be uncorrelated with X and these have, on av- 
erage, zero impact on y. This is a key assumption that was referred to in Section 2.3 
as the weak exogeneity assumption. Essentially this means that the knowledge of the 
data-generating process for X variables does not contribute useful information for es- 
timating G. When the assumption fails, one or more of the K regressor variables is 
said to be jointly dependent with y, or simply endogenous. A general term for cor- 
relation of regressors with errors is endogeneity or endogenous regressors, where 
the term “endogenous” means caused by factors inside the system. As we will show 
in Section 4.7, the violation of weak exogeneity may lead to inconsistent estimation. 
There are many ways in which weak exogeneity can be violated, but one of the most 
common involves a variable in x that is a choice or a decision variable that is related 
to y in a larger model. Ignoring these other relationships, and treating x; as if it were 
randomly assigned to observation i, and hence uncorrelated with u;, will have non- 
trivial consequences. Endogenous sampling is ruled out by Assumption 4. Instead, 
if data are collected by stratified random sampling it must be exogenous stratified 
sampling. 


Conditionally Heteroskedastic Errors 


Independent regression errors uncorrelated with regressors are assumed, a conse- 
quence of Assumptions 1, 2, and 4. Introductory courses usually further restrict at- 
tention to errors that are homoskedastic with homogeneous or constant variances, in 


which case o? = o? for all i. Then the errors are iid (0, ø?) and are called spherical 


l 
errors since Q = o°I. 

Assumption 5 is instead one of conditionally heteroskedastic regression errors, 
where heteroskedastic means heterogeneous variances or different variances. The as- 
sumption is stated in terms of the second moment E[u?|x], but this equals the vari- 
ance V[u|x] since E[u|x] = 0 by Assumption 4. This more general assumption of het- 
eroskedastic errors is made because empirically this is often the case for cross-section 
regression. Furthermore, relaxing the homoskedasticity assumption is not costly as it 
is possible to obtain valid standard errors for the OLS estimator even if the functional 
form for the heteroskedasticity is unknown. 

The term conditionally heteroskedastic is used for the following reason. Even if 
(Yi, X;) are iid, as is the case for simple random sampling, once we condition on x; 
the conditional mean and conditional variance can vary with x;. Similarly, the errors 
uj = yi — X,@ are iid under simple random sampling, and they are therefore uncon- 
ditionally homoskedastic. Once we condition on x;, and consider the distribution of 
u; conditional on x;, the variance of this conditional distribution is permitted to vary 
with x;. 
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Limit Variance Matrix of N~!/2X’u 


Assumption 6 is needed to obtain the limit variance matrix of N~!/?X’u. If regressors 
are independent of the errors, a stronger assumption than that made in Assumption 
4, then Assumption 5 that E[|u?|!t°] < oo and Assumption 3 that E[|x;;x;,|'t°] < 00 
imply the Assumption 6 condition that E[|u?x;;xjx|'*°] < 00. 

We have deliberately not made a seventh assumption, that the error u is normally 
distributed conditional on X. An assumption such as normality is needed to obtain the 
exact small-sample distribution of the OLS estimator. However, we focus on asymp- 
totic methods throughout this book, because exact small-sample distributional results 
are rarely available for the estimators used in microeconometrics, and then the normal- 
ity assumption is no longer needed. 


4.4.8. Derivations for the OLS Estimator 


Here we present both small-sample and limit distributions of the OLS estimator and 
justify White’s estimator of the variance matrix of the OLS estimator under Assump- 
tions 1-6. 


Small-Sample Distribution 


The parameter 8 is identified under Assumptions 1—4 since then E[y|X] = X6 and X 
has rank K. 

In small samples the OLS estimator is unbiased under Assumptions 1—4 and its vari- 
ance matrix is easily obtained given Assumption 5. These results are obtained by using 
the law of iterated expectations to first take expectation with respect to u conditional 
on X and then take the unconditional expectation. Then from (4.11) 


ElGors] = 8 + Exu [XX X'u] (4.24) 
= B + Ex [Eux [(X’X)'X’u|X]] 
= 6 + Ex [(X’X)'X’Eqx{ulX]] 
=, 
using the law of iterated expectations (Theorem A.23) and given Assumptions 1 and 
4, which together imply that E[u|X] = 0. Similarly, (4.11) yields 
ViBors] = Ex[(X’X)'X’QX(X'X) '], (4.25) 


given Assumption 5, where E/uu’ |X] = Q and we use Theorem A.23, which tells us 
that in general 


Vx. ulg(X, w)] = Ex[Vujx[g(X, u)]] + VxlEux[g(X, u)]]. 


This simplifies here as the second term is zero since Eux[(X'X)'X’u] = 0. 

The OLS estimator is therefore unbiased if E[u|X] = 0. This valuable property 
generally does not extend to nonlinear estimators. Most nonlinear estimators, such 
as nonlinear least squares, are biased and even linear estimators such as instrumental 
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variables estimators can be biased. The OLS estimator is inefficient, as its variance 
is not the smallest possible variance matrix among linear unbiased estimators, unless 
Q=o7I. The inefficiency of OLS provides motivation for more efficient estimators 
such as generalized least squares, though the efficiency loss of OLS is not necessarily 
great. Under the additional assumption of normality of the errors conditional on X, an 
assumption not usually made in microeconometrics applications, the OLS estimator is 
normally distributed conditional on X. 


Consistency 


The term plim (N “EX’'X) = M}; since plim N~'X’X = M,x by Assumption 3. 


Consistency then requires that condition (4.13) holds. This is established using a law 
of large numbers applied to the average N~'X’u =N~! 9°, x;u;, which converges in 
probability to zero if E[x;u;] = 0. Given Assumptions 1 and 2, the x;u; are inid and 
Assumptions 1-5 permit use of the Markov LLN (Theorem A.9). If Assumption 1 is 
simplified to (y;, x;) tid then x;u; are iid and Assumptions 1—4 permit simpler use of 
the Kolmogorov LLN (Theorem A.8). 


Limit Distribution 


By Assumption 3, plim (N iXix) = M,]. The key is to obtain the limit distribu- 
tion of N~'/?X'u = N`! Ù, xiu; by application of a central limit theorem. Given 
Assumptions 1 and 2, the x;u; are inid and Assumptions 1—6 permit use of the Lia- 
pounov CLT (Theorem A.15). If assumption 1 is strengthened to (y;, x;) iid then x;u; 
are iid and Assumptions 1-5 permit simpler use of the Lindeberg—Levy CLT (Theo- 
rem A.14). 

This yields 


1 
ake £ N (0, Myox]. (4.26) 


where Mxox = plim N~!X’uu’X =plim N`! °, u?x;x; given independence over i. 
Application of a law of large numbers yields Mxox = lim N7! 2 Ey [o?x;x/], us- 
ing Eux; [u?x;x.] = Ex, [E[u? |x; ]x;x'] and o? = E[u?]x;]. It follows that Mxox = 
lim N~'E[X’Q.X], where Q =Diag[o?] and the expectation is with respect to only 
X, rather than both X and u. 

The presentation here assumes independence over i. More generally we can permit 
correlated observations. Then Mxox = plim N`! X oF uju 7XiX; and Q has ijth en- 
try o;; =Cov[u;, uj]. This complication is deferred to treatment of the nonlinear LS 
estimator in Section 5.8. 


Heteroskedasticity-Robust Standard Errors 
We consider the key step of consistent estimation of Mynx. Beginning with the original 


definition of Mxax = plim N7! ea u>x;X,, we replace u; by u; = y; — x,3, where 
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asymptotically t; + u; since B + B. This yields the consistent estimate 
= X axx = N71X’OX, (4.27) 


where Q = Diag[a?]. The additional assumption that E[|x;; xinxi|'t?] < A for positive 
constants 6 and A and j, k,l = 1,..., K is needed, as 7 X;X; = (ui — x(B — B))?x:x’ 
involves up to the fourth power of x; (see White (1980a)). 

Note that Q does not converge to the N x N matrix Q, a seemingly impos- 
sible task without additional structure as there are N variances o? to be esti- 
mated. But all that is needed is that N~!X’OX converges to the K x K matrix 
plim N~'X’QX =N! plim >, o?x;x,. This is easier to achieve because the number 
of regressors K is fixed. To understand White’s estimator, consider OLS estimation of 
the intercept-only model y; = 6 + u; with heteroskedastic error. Then in our notation 
we can show that 8 = 7, My = lim N7! >), 1=1, and Myx = lim N`! >, E[u?]. 
An obvious estimator for Myox is Myox = N`! >. , u?, where Ù; = yi — B. To obtain 
the probability limit of this estimate, it is enough to consider N~! X; u?, since Ñ; — 
Uj +0 given B ne p. If a law of large numbers can be applied this average converges 
to the limit of its expected value, so plim N“! >, u? = lim N~! X, E[u?] = Myox as 
desired. Eicker (1967) gave the formal conditions for this example. 


4.5. Weighted Least Squares 


If robust standard errors need to be used efficiency gains are usually possible. For 
example, if heteroskedasticity is present then the feasible generalized least-squares 
(GLS) estimator is more efficient than the OLS estimator. 

In this section we present the feasible GLS estimator, an estimator that makes 
stronger distributional assumptions about the variance of the error term. It is nonethe- 
less possible to obtain standard errors of the feasible GLS estimator that are robust to 
misspecification of the error variance, just as in the OLS case. 

Many studies in microeconometrics do not take advantage of the potential efficiency 
gains of GLS, for reasons of convenience and because the efficiency gains may be felt 
to be relatively small. Instead, it is common to use less efficient weighted least-squares 
estimators, most notably OLS, with robust estimates of the standard errors. 


4.5.1. GLS and Feasible GLS 


By the Gauss—Markov theorem, presented in introductory texts, the OLS estimator is 
efficient among linear unbiased estimators if the linear regression model errors are 
independent and homoskedastic. 

Instead, we assume that the error variance matrix Q407I. If Q is known and 
nonsingular, we can premultiply the linear regression model (4.8) by Q7!/7, where 
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Q!2Q! = Q, to yield 
Q?y = 07?x8B+007'7u. 


Some algebra yields VIQR 7a] = E[(Q7 u (Q 7a) |X] =I. The errors in this 
transformed model are therefore zero mean, uncorrelated, and homoskedastic. So 3 
can be efficiently estimated by OLS regression of Qy on 71K, 

This argument yields the generalized least-squares estimator 


Bots = (X’27'X) X'O y. (4.28) 


The GLS estimator cannot be directly implemented because in practice 2 is not 
known. Instead, we specify that Q = Q(y), where y is a finite-dimensional parameter 
vector, obtain a consistent estimate ¥ of ~y, and form Q= Q(F). For example, if errors 
are heteroskedastic then specify V[u|x] = exp(z'y), where z is a subset of x and the 
exponential function is used to ensure a positive variance. Then ¥ can be consistently 
estimated by nonlinear least-squares regression (see Section 5.8) of the squared OLS 
residual R? = (y — x Bois)” on exp(z'y). This estimate $ Q can be used in place of Q 
in (4.28). Note that we cannot replace 2 in (4.28) by Q= Diag[a?] as this yields an 
inconsistent estimator (see Section 5.8.6). 

The feasible generalized least-squares (FGLS) estimator is 


Becis = XQ 'x)- ya! y. (4.29) 


If Assumptions 1—6 hold and Q(y) is correctly specified, a strong assumption that is 
relaxed in the following, and Ẹ is consistent for ~, it can be shown that 


VN (Brats — B) SN [o, (plim Naxox) “| ; (4.30) 


The FGLS estimator has the same limiting variance matrix as the GLS estimator and 
so is second-moment efficient. For implementation replace Q by Qin (4.30). 

It can be shown that the GLS estimator minimizes wu, see Exercise 4.5, which 
simplifies to X`; u?/o? if errors are heteroskedastic but uncorrelated. The motivation 
provided for GLS was efficient estimation of 6. In terms of the Section 4.2 discussion 
of loss function and optimal prediction, with heteroskedastic errors the loss function is 
L(e) = e? /o?. Compared to OLS with L(e) = e?, the GLS loss function places a rel- 
atively smaller penalty on the prediction error for observations with large conditional 
error variance. 


4.5.2. Weighted Least Squares 


The result in (4.30) assumes correct specification of the error variance matrix 12(7+). 
If instead Q(y) is misspecified then the FGLS estimator is still consistent, but (4.30) 
gives the wrong variance. Fortunately, a robust estimate of the variance of the GLS 
estimator can be found even if (7) is misspecified. 

Specifically, define X = (7) to be a working variance matrix that does not nec- 
essarily equal the true variance matrix Q =E[uu’|X]. Form an estimate S= =), 
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Table 4.2. Least-Squares Estimators and Their Asymptotic Variance 


Estimator“ Definition Estimated Asymptotic Variance 
OLS B= (x’ x)" (XX) | XOX (X’X) 

FGLS B= XA ‘xy ixa y XX! 

WLS B=(KE'X XS y XE xy EAS XAS X. 


“ Estimators are for linear regression model with error conditional variance matrix Q. For FGLS it is 
assumed that Q is consistent for Q. For OLS and WLS the heteroskedastic robust variance matrix of B 
uses Q equal to a diagonal matrix with squared residuals on the diagonals. 


where ¥ is an estimate of y. Then use weighted least squares with weighting ma- 
trix 5 | 
This yields the weighted least-squares (WLS) estimator 


Bws = (WE XIX S y, (4.31) 
Statistical inference is then done without the assumption that © = Q, the true variance 
matrix of the error term. In the statistics literature this approach is referred to as a 
working matrix approach. We call it weighted least squares, but be aware that others 
instead use weighted least squares to mean GLS or FGLS in the special case that Q~! 
is diagonal. Here there is no presumption that the weighting matrix X~! = Q7!. 
Similar algebra to that for OLS given in Section 4.4.5 yields the estimated asymp- 
totic variance matrix 


laa- 


ViGwis] = XEK X'S AS XAS X, (4.32) 


where Q is such that 
plim NXE ‘AS X =plim NXE IOE !X. 


In the heteroskedastic case Q = Diag[R*?], where 7 = y; — x, Bws. 

For heteroskedastic errors the basic approach is to choose a simple model for het- 
eroskedasticity such as error variance depending on only one or two key regressors. For 
example, in a linear regression model of the level of wages as a function of schooling 
and other variables, the heteroskedasticity might be modeled as a function of school- 
ing alone. Suppose this model yields $= Diag[a?]. Then OLS regression of y;/G; on 
x;/G; (with the no-constant option) yields Bwis and the White robust standard errors 
from this regression can be shown to equal those based on (4.32). 

The weighted least-squares or working matrix approach is especially convenient 
when there is more than one complication. For example, in the random effects panel 
data model of Chapter 21 the errors may be viewed as both correlated over time for a 
given individual and heteroskedastic. One may use the random effects estimator, which 
controls only for the first complication, but then compute heteroskedastic-consistent 
standard errors for this estimator. 

The various least-squares estimators are summarized in Table 4.2. 
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Table 4.3. Least Squares: Example with 
Conditionally Heteroskedastic Errors“ 


OLS WLS GLS 
Constant 2.213 1.060 0.996 
(0.823) (0.150) (0.007) 
[0.820] [0.051] [0.006] 
x 0.979 0.957 0.952 
(0.178) (0.190) (0.209) 
[0.275] [0.232] [0.208] 
R? 0.236 0.205 0.174 


“ Generated data for sample size of 100. OLS, WLS, and GLS 
are all consistent but OLS and WLS are inefficient. Two differ- 
ent standard errors are given: default standard errors assuming 
homoskedastic errors in parentheses and heteroskedastic-robust 
standard errors in square brackets. The data-generating process 
is given in the text. 


4.5.3. Robust Standard Errors for LS Example 


As an example of robust standard error estimation, consider estimation of the standard 
error of least-squares estimates of the slope coefficient for a dgp with multiplicative 
heteroskedasticity 


y=1+1xx+u, 


u = XE, 


where the scalar regressor x ~ MN [0, 25] and e ~ N’[0, 4]. 

The errors are conditionally heteroskedastic, since V[u|x]=V[xe|x] = 
x°V[e|x] = 4x7, which depends on the regressor x. This differs from the unconditional 
variance, where V[u]=V[xe] = E[(xe)?}] — (E[xe])* =E[x7]E[e*] = V[x]V [e] = 
100, given x and ¢ independent and the particular dgp here. 

Standard errors for the OLS estimator should be calculated using the 
heteroskedastic-consistent or robust variance estimate (4.21). Since OLS is not fully 
efficient, WLS may provide efficiency gains. GLS will definitely provide efficiency 
gains and in this simulated data example we have the advantage of knowing that 
V[u|x] =4x?. All estimation methods yield a consistent estimate of the intercept and 
slope coefficients. 

Various least-squares estimates and associated standard errors from a generated data 
sample of size 100 are given in Table 4.3. We focus on the slope coefficient. 

The OLS slope coefficient estimate is 0.979. Two standard error estimates are re- 
ported, with the correct heteroskedasticity-robust standard error of 0.275 using (4.21) 
much larger here than the incorrect estimate of 0.177 that uses s*(X’X)~!. Such a large 
difference in standard error estimates could lead to quite different conclusions in statis- 
tical inference. In general the direction of bias in the standard errors could be in either 
direction. For this example it can be shown theoretically that, in the limit, the robust 
standard errors are /3 times larger than the incorrect one. Specifically, for this dgp 
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and for sample size N the correct and incorrect standard errors of the OLS estimate of 
the slope coefficient converge to, respectively, ./12/N and ./4/N. 

As an example of the WLS estimator, assume that u = ./|x|e rather than u = xe, 
so that it is assumed that V[u] = o7|x|. The WLS estimator can be computed by OLS 
regression after dividing y, the intercept, and x by \/]x]. Since this is the wrong model 
for the heteroskedastic error the correct standard error for the slope coefficient is the 
robust estimate of 0.232, computed using (4.32). 

The GLS estimator for this dgp can be computed by OLS regression after dividing 
y, the intercept, and x by |x|, since the transformed error is then homoskedastic. The 
usual and robust standard errors for the slope coefficient are similar (0.209 and 0.208). 
This is expected as both are asymptotically correct because the GLS estimator here 
uses the correct model for heteroskedasticity. It can be shown theoretically that for this 
dgp the standard error of the GLS estimate of the slope coefficient converges to./4/N. 

Both OLS and WLS are less efficient than GLS, as expected, with standard errors 
for the slope coefficient of, respectively, 0.275 > 0.232 > 0.208. 

The setup in this example is a standard one used in estimation theory for cross- 
section data. Both y and x are stochastic random variables. The pair (y;, x;) are inde- 
pendent over i and identically distributed, as is the case under random sampling. The 
conditional distribution of y;|x; differs over i, however, since the conditional mean and 
variance of y; depend on x;. 


4.6. Median and Quantile Regression 


In an intercept-only model, summary statistics for the sample distribution include 
quantiles, such as the median, lower and upper quartiles, and percentiles, in addition 
to the sample mean. 

In the regression context we might similarly be interested in conditional quantiles. 
For example, interest may lie in how the percentiles of the earnings distribution for 
lowly educated workers are much more compressed than those for highly educated 
workers. In this simple example one can just do separate computations for lowly ed- 
ucated workers and for highly educated workers. However, this approach becomes 
infeasible if there are several regressors taking several values. Instead, quantile regres- 
sion methods are needed to estimate the quantiles of the conditional distribution of y 
given x. 

From Table 4.1, quantile regression corresponds to use of asymmetric absolute loss, 
whereas the special case of median regression uses absolute error loss. These methods 
provide an alternative to OLS, which uses squared error loss. 

Quantile regression methods have advantages beyond providing a richer charac- 
terization of the data. Median regression is more robust to outliers than least-squares 
regression. Moreover, quantile regression estimators can be consistent under weaker 
stochastic assumptions than possible with least-squares estimation. Leading examples 
are the maximum score estimator of Manski (1975) for binary outcome models (see 
Section 14.6) and the censored least absolute deviations estimator of Powell (1984) for 
censored models (see Section 16.6). 
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We begin with a brief explanation of population quantiles before turning to estima- 
tion of sample quantiles. 


4.6.1. Population Quantiles 


For a continuous random variable y, the population qth quantile is that value jz, such 
that y is less than or equal to u4 with probability q. Thus 


q = Prly < ug] = Fy), 


where F is the cumulative distribution function (cdf) of y. For example, if 40.75 = 3 
then the probability that y < 3 equals 0.75. It follows that 


Hq = F; (q). 


Leading examples are the median, q = 0.5, the upper quartile, q = 0.75, and the lower 
quartile, q = 0.25. For the standard normal distribution uo.s = 0.0, 0.95 = 1.645, and 
[0.975 = 1.960. The 100qth percentile is the qth quantile. 

For the regression model, the population gth quantile of y conditional on x is 
that function j1,(x) such that y conditional on x is less than or equal to z,(x) with 
probability g, where the probability is evaluated using the conditional distribution of 
y given x. It follows that 


a) = F(Q), (4.33) 


where Fx is the conditional cdf of y given x and we have suppressed the role of the 
parameters of this distribution. 

It is insightful to derive the quantile function jz,(x) if the dgp is assumed to be the 
linear model with multiplicative heteroskedasticity 


y=xß+u, 
u =XQ x E, 
e ~ iid [0, 07], 


where it is assumed that x’a > 0. Then the population qth quantile of y conditional 
on x is that function u4(X, B, œ) such that 


q = Prly < ug (x, L, aœ) 
= Pr[u < u(x, B, a) — xB] 
= Pr[e < [u(x B, a) — x'8]/x'a] 
= F, ([Mq(x, B, a) — x'B]/x'a) , 


where we use u = y —x’@ and € = u/x’a, and F, is the cdf of e. It follows that 
[u4 X, B, a) — x'B]/x’a = F; ' (q) so that 


U, b,a) =x B+ xa x F(q) 
=x (B+ax F 'q)). 
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Thus for the linear model with multiplicative heteroskedasticity of the form u = x’a@ x 
e the conditional quantiles are linear in x. In the special case of homoskedasticity, xa 
equals a constant and all conditional quantiles have the same slope and differ only in 
their intercept, which becomes larger as q increases. 

In more general examples the quantile function may be nonlinear in x, owing to 
other forms of heteroskedasticity such as u = h(x, œ) x £, where h(-) is nonlinear in 
x, or because the regression function itself is of nonlinear form g(x, 8). It is standard 
to still estimate quantile functions that are linear and interpret them as the best lin- 
ear predictor under the quantile regression loss function given in (4.34) in the next 
section. 


4.6.2. Sample Quantiles 


For univariate random variable y the usual way to obtain the sample quantile estimate 
is to first order the sample. Then 7, equals the [Nq]th smallest value, where N is the 
sample size and [Nq] denotes Nq rounded up to the nearest integer. For example, if 
N = 97, the lower quartile is the 25th observation since [97 x 0.25] = [24.25] = 25. 

Koenker and Bassett (1978) observed that the sample gth quantile Lg can equiv- 
alently be expressed as the solution to the optimization problem of minimizing with 
respect to B 


N N 
X aly -—Bl+ D> A- Dy - Pl. 
i:yiZB i:yi<B 


This result is not obvious. To gain some understanding, consider the median, where 
q = 0.5. Then the median is the minimum of J`; |y; — £|. Suppose in a sample 
of 99 observations that the 50th smallest observation, the median, equals 10 and 
the 51st smallest observation equals 12. If we let 6 equal 12 rather than 10, then 
X; |y: — £| will increase by 2 for the first 50 ordered observations and decrease by 
2 for the remaining 49 observations, leading to an overall net increase of 50 x 2 — 
49 x 2 = 2. So the 51st smallest observation is a worse choice than the 50th. Simi- 
larly the 49th smallest observation can be shown to be a worse choice than the 50th 
observation. 

This objective function is then readily expanded to the linear regression case, so 
that the gth quantile regression estimator B, minimizes over 6; 


N N 
OnB = >> gli-¥Bl+ D> -Dli - xb, (4.34) 
iiyizx B ity; <x; 


where we use 6, rather than 6 to make clear that different choices of q estimate 
different values of 3. Note that this is the asymmetric absolute loss function given in 
Table 4.1, where 9 is restricted to be linear in x so that e = y — x’. The special case 
q = 0.5 is called the median regression estimator or the least absolute deviations 
estimator. 
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4.6.3. Properties of Quantile Regression Estimators 


The objective function (4.34) is not differentiable and so the gradient optimization 
methods presented in Chapter 10 are not applicable. Fortunately, linear programming 
methods can be used and these provide relatively fast computation of B4- 

Since there is no explicit solution for B, the asymptotic distribution of B, cannot 
be obtained using the approach of Section 4.4 for OLS. The methods of Chapter 5 also 
require adaptation, as the objective function is nondifferentiable. It can be shown that 


VN@, —6,)-> N [0, A'BA™'], (4.35) 
(see, for example, Buchinsky, 1998, p. 85), where 


1 N 
A = plim 7 3 Fu, (O|X:)X;X;, (4.36) 
1 N 
B = plim © B — q)XiX,, 


i=1 


and f;,,(O|x) is the conditional density of the error term ug = y — x’ B; evaluated 
at uq = 0. Estimation of the variance of B, is complicated by the need to estimate 


fu, (|x). It is easier to instead obtain standard errors for 3, using the bootstrap pairs 
procedure of Chapter 11. 


4.6.4. Quantile Regression Example 


In this section we perform conditional quantile estimation and compare it with the 
usual conditional mean estimation using OLS regression. The application involves En- 
gel curve estimation for household annual medical expenditure. More specifically, we 
consider the regression relationship between the log of medical expenditure and the 
log of total household expenditure. This regression yields an estimate of the (constant) 
elasticity of medical expenditure with respect to total expenditure. 

The data are from the World Bank’s 1997 Vietnam Living Standards Survey. The 
sample consists of 5,006 households that have positive level of medical expenditures, 
after dropping 16.6% of the sample that has zero expenditures to permit taking the 
natural logarithm. Zero values can be handled using the censored quantile regression 
methods of Powell (1986a), presented in Section 16.9.2. For simplicity we simply 
dropped observations with zero expenditures. The largest component of medical ex- 
penditure, especially at low levels of income, consists of medications purchased from 
pharmacies. Although several household characteristic variables are available, for sim- 
plicity we only consider one regressor, the log of total household expenditure, to serve 
as a proxy for household income. 

The linear least-squares regression yields an elasticity estimate of 0.57. This esti- 
mate would be usually interpreted to mean that medicines are a “necessity” and hence 
their demand is income inelastic. This estimate is not very surprising, but before ac- 
cepting it at face value we should acknowledge that there may be considerable hetero- 
geneity in the elasticity across different income groups. 
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Slope Estimates as Quantile Varies 


Upper 95% confidence band 


Slope and confidence bands 


Quantile slope coefficient 
Lower 95% confidence band 


OLS slope coefficient 


Quantile 


Figure 4.1: Quantile regression estimates of slope coefficient for q=0.05,0.10,..., 
0.90, 0.95 and associated 95% confidence bands plotted against q from regression of the 
natural logarithm of medical expenditure on the natural logarithm of total expenditure. 


Quantile regression is a useful tool for studying such heterogeneity, as emphasized 
by Koenker and Hallock (2001). We minimize the quantity (4.34), where y is log of 
medical expenditure and x' 68 = 6; + 2x, where x is log of total household expendi- 
ture. This is done for the nineteen quantile values q = {0.05, 0.10, ..., 0.95}, where 
q = 0.5 is the median. In each case the standard errors were estimated using the boot- 
strap method with 50 resamples. The results of this exercise are condensed into Fig- 
ures 4.1 and 4.2. 

Figure 4.1 plots the slope coefficient Bo, q for the different values of q, along with 
the associated 95% confidence interval. This shows how the quantile estimates of the 
elasticity varies with quantile value q. The elasticity estimate increases systematically 
with the level of household income, rising from 0.15 for g = 0.05 to a maximum of 
0.80 for q = 0.85. The least-squares slope estimate of 0.57 is also presented as a hori- 
zontal line that does not vary with quantile. The elasticity estimates at lower and higher 
quantiles are clearly statistically significantly different from each other and from the 
OLS estimate, which has standard error 0.032. It seems that the aggregate elasticity es- 
timate will vary according to changes in the underlying income distribution. This graph 
supports the observation of Mosteller and Tukey (1977, p. 236), quoted by Koenker 
and Hallock (2001), that by focusing only on the conditional mean function the least- 
squares regression gives an incomplete summary of the joint distribution of dependent 
and explanatory variables. 

Figure 4.2 superimposes three estimated quantile regression lines y, = Bis + 
Bog for q = 0.1, 0.2, ...,0.9 and the OLS regression line. The OLS regression line, 
not graphed, is similar to the median (q = 0.5) regression line. There is a fanning out 
of the quantile regression lines in Figure 4.2. This is not surprising given the increase 
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Regression Lines as Quantile Varies 


Actual Data 
90th percentile 
Median 

10th percentile 


Log Household Total Expenditure 


Log Household Medical Expenditure 


Figure 4.2: Quantile regression estimated lines for q = 0.1, q = 0.5 and q = 0.9 from re- 
gression of natural logarithm of medical expenditure on natural logarithm of total expenditure. 
Data for 5006 Vietnamese households with positive medical expenditures in 1997. 


in estimated slopes as q increases as evident in Figure 4.1. Koenker and Bassett (1982) 
developed quantile regression as a means to test for heteroskedastic errors when the 
dgp is the linear model. For such a case a fanning out of the quantile regression lines 
is interpreted as evidence of heteroskedasticity. Another interpretation is that the con- 
ditional mean is nonlinear in x with increasing slope and this leads to quantile slope 
coefficients that increase with quantile q. 

More detailed illustrations of quantile regression are given in Buchinsky (1994) and 
Koenker and Hallock (2001). 


4.7. Model Misspecification 


The term “model misspecification” in its broadest sense means that one or more of the 
assumptions made on the data generating process are incorrect. Misspecifications may 
occur individually or in combination, but analysis is simpler if only the consequences 
of a single misspecification are considered. 

In the following discussion we emphasize misspecifications that lead to inconsis- 
tency of the least-squares estimator and loss of identifiability of parameters of inter- 
est. The least-squares estimator may nonetheless continue to have a meaningful inter- 
pretation, only one different from that intended under the assumption of a correctly 
specified model. Specifically, the estimator may converge asymptotically to a param- 
eter that differs from the true population value, a concept defined in Section 4.7.5 as 
the pseudo-true value. 

The issues raised here for consistency of OLS are relevant to other estimators in 
other models. Consistency can then require stronger assumptions than those needed 
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for consistency of OLS, so that inconsistency resulting from model misspecification is 
more likely. 


4.7.1. Inconsistency of OLS 


The most serious consequence of a model misspecification is inconsistent estimation 
of the regression parameters 8. From Section 4.4, the two key conditions needed to 
demonstrate consistency of the OLS estimator are (1) the dgp is y = X8 + u and (2) 
the dgp is such that plim N~!X’u = 0. Then 


Bors = B+ (N7'X'X) 
p 
> B, 
where the first equality follows if y = XG + u (see (4.12)) and the second line uses 
plim N“'X’'u = 0. 
The OLS estimator is likely to be inconsistent if model misspecification leads to 
either specification of the wrong model for y, so that condition 1 is violated, or corre- 
lation of regressors with the error, so that condition 2 is violated. 


1 
N'X' 
5 (4.37) 


4.7.2. Functional Form Misspecification 


A linear specification of the conditional mean function is merely an approximation in 
RE to the true unknown conditional mean function in parameter space of indeterminate 
dimension. Even if the correct regressors are chosen, it is possible that the conditional 
mean is incorrectly specified. 

Suppose the dgp is one with a nonlinear regression function 


y=sx)+v, 
where the dependence of g(x) on unknown parameters is suppressed, and assume 
E[v|x] = 0. The linear regression model 

y=xß+u 


is erroneously specified. The question is whether the OLS estimator can be given any 
meaningful interpretation, even though the dgp is in fact nonlinear. 

The usual way to interpret regression coefficients is through the true micro relation- 
ship, which here is 


E[y;|x;] = g(x). 


In this case Bors does not measure the micro response of E[y;|x;] to a change in x;, as 
it does not converge to 0g(x;)/0x;. So the usual interpretation of Bors is not possible. 

White (1980b) showed that the OLS estimator converges to that value of @ that 
minimizes the mean-squared prediction error 


Exl(g(x) — x’ B)’]. 
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Hence prediction from OLS is the best linear predictor of the nonlinear regression 
function if the mean-squared error is used as the loss function. This useful property 
has already been noted in Section 4.2.3, but it adds little in interpretation of Bors: 

In summary, if the true regression function is nonlinear, OLS is not useful for indi- 
vidual prediction. OLS can still be useful for prediction of aggregate changes, giving 
the sample average change in E[y|x] due to change in x (see Stoker, 1982). However, 
microeconometric analyses usually seek models that are meaningful at the individual 
level. 

Much of this book presents alternatives to the linear model that are more likely to 
be correctly specified. For example, Chapter 14 on binary outcomes presents model 
specifications that ensure that predicted probabilities are restricted to lie between 0 
and 1. Also, models and methods that rely on minimal distributional assumptions are 
preferred because there is then less scope for misspecification. 


4.7.3. Endogeneity 


Endogeneity is formally defined in Section 2.3. A broad definition is that a regressor 
is endogenous when it is correlated with the error term. If any one regressor is en- 
dogenous then in general OLS estimates of all regression parameters are inconsistent 
(unless the exogenous regressor is uncorrelated with the endogenous regressor). 

Leading examples of endogeneity, dealt with extensively in this book in both linear 
and nonlinear model settings, include simultaneous equations bias (Section 2.4), omit- 
ted variable bias (Section 4.7.4), sample selection bias (Section 16.5), and measure- 
ment error bias (Chapter 26). Endogeneity is quite likely to occur when cross-section 
observational data are used, and economists are very concerned with this complication. 

A quite general approach to control for endogeneity is the instrumental variables 
method, presented in Sections 4.8 and 4.9 and in Sections 6.4 and 6.5. This method 
cannot always be applied, however, as necessary instruments may not be available. 

Other methods to control for endogeneity, reviewed in Section 2.8, include con- 
trol for confounding variables, differences in differences if repeated cross-section or 
panel data are available (see Chapter 21), fixed effects if panel data are available and 
endogeneity arises owing to a time-invariant omitted variable (see Section 21.6), and 
regression-discontinuity design (see Section 25.6). 


4.7.4. Omitted Variables 


Omission of a variable in a linear regression equation is often the first example of 
inconsistency of OLS presented in introductory courses. Such omission may be the 
consequence of an erroneous exclusion of a variable for which data are available or of 
exclusion of a variable that is not directly observed. For example, omission of ability in 
a regression of earnings (or more usually its natural logarithm) on schooling is usually 
due to unavailability of a comprehensive measure of ability. 

Let the true dgp be 


y=xB+zat+v, (4.38) 
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where x and z are regressors, with z a scalar regressor for simplicity, and v is an error 
term that is assumed to be uncorrelated with the regressors x and z. OLS estimation of 
y on x and z will yield consistent parameter estimates of 6 and a. 

Suppose instead that y is regressed on x alone, with z omitted owing to unavailabil- 
ity. Then the term za is moved into the error term. The estimated model is 


y =x B+ (za +v), (4.39) 


where the error term is now (zæ + v). As before v is uncorrelated with x, but if z is 
correlated with x the error term (zæ + v) will be correlated with the regressors x. The 
OLS estimator will be inconsistent for 8 if z is correlated with x. 

There is enough structure in this example to determine the direction of the inconsis- 
tency. Stacking all observations in an obvious manner gives the dgp y = XG + za + v. 
Substituting this into Bors = (X’X) | X'y yields 


Bors=B+ (N-!X’xX) (N7'X'2) a+ (N~!X’X) (N~'X’y). 


Under the usual assumption that X is uncorrelated with v, the final term has probability 
limit zero. X is correlated with z, however, and 


plim Bors = B+éa, (4.40) 
where 
5 = plim[(N~'X’X)! (N7'X’z) 


is the probability limit of the OLS estimator in regression of the omitted regressor (z) 
on the included regressors (X). 

This inconsistency is called omitted variables bias, where common terminology 
states that various misspecifications lead to bias even though formally they lead to 
inconsistency. The inconsistency exists as long as ô 40, that is, as long as the omitted 
variable is correlated with the included regressors. In general the inconsistency could 
be positive or negative and could even lead to a sign reversal of the OLS coefficient. 

For the returns to schooling example, the correlation between schooling and ability 
is expected to be positive, so 6 > 0, and the return to ability is expected to be positive, 
so a > 0. It follows that da > 0, so the omitted variables bias is positive in this ex- 
ample. OLS of earnings on schooling alone will overstate the effect of education on 
earnings. 

A related form of misspecification is inclusion of irrelevant regressors. For ex- 
ample, the regression may be of y on x and z, even though the dgp is more simply 
y = x’3+4 v. In this case it is straightforward to show that OLS is consistent, but there 
is a loss of efficiency. 

Controlling for omitted variables bias is necessary if parameter estimates are to be 
given a causal interpretation. Since too many regressors cause little harm, but too few 
regressors can lead to inconsistency, microeconometric models estimated from large 
data sets tend to include many regressors. If omitted variables are still present then one 
of the methods given at the end of Section 4.7.3 is needed. 
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4.7.5. Pseudo-True Value 


In the omitted variables example the least-squares estimator is subject to confounding 
in the sense that it does not estimate G, but instead estimates a function of G, 6, and a. 

The OLS estimate cannot be used as an estimate of 3, which, for example, measures 
the effect of an exogenous change in a regressor x such as schooling holding all other 
regressors including ability constant. 

From (4.40), however, Bors is a consistent estimator of the function (G+ da) and 
has a meaningful interpretation. The probability limit of Bors of B* = (B + da) is 
referred to as the pseudo-true value, see Section 5.7.1 for a formal definition, corre- 
sponding to Bors- 

Furthermore, one can obtain the distribution of Bors even though it is inconsis- 
tent for G. The estimated asymptotic variance of Bois measures dispersion around 
(B + 6a) and is given by the usual estimator, for example by s?(X’X)~! if the error in 
(4.38) is homoskedastic. 


4.7.6. Parameter Heterogeneity 


The presentation to date has permitted regressors and error terms to vary across indi- 
viduals but has restricted the regression parameters (3 to be the same across individuals. 
Instead, suppose that the dgp is 


y= x, 3;+u;, (4.41) 


with subscript i on the parameters. This is an example of parameter heterogeneity, 
where the marginal effect E[y;|x;] = B; is now permitted to differ across individuals. 

The random coefficients model or random parameters model specifies 6; to be 
independently and identically distributed over i with distribution that does not depend 
on the observables x;. Let the common mean of 6; be denoted 3. The dgp can be 
rewritten as 


yi = XB + (u; + x;(6; — 8), 


and enough assumptions have been made to ensure that the regressors x; are uncorre- 
lated with the error term (u; + x;(G; — B)). OLS regression of y on x will therefore 
consistently estimate 8, though note that the error is heteroskedastic even if u; is ho- 
moskedastic. 

For panel data a standard model is the random effects model (see Section 21.7) that 
lets the intercept vary across individuals while the slope coefficients are not random. 

For nonlinear models a similar result need not hold, and random parameter models 
can be preferred as they permit a richer parameterization. Random parameter models 
are consistent with existence of heterogeneous responses of individuals to changes in 
x. A leading example is random parameters logit in Section 15.7. 

More serious complications can arise when the regression parameters 3; for an 
individual are related to observed individual characteristics. Then OLS estimation can 
lead to inconsistent parameter estimation. An example is the fixed effects model for 
panel data (see Section 21.6) for which OLS estimation of y on x is inconsistent. In 
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this example, but not in all such examples, alternative consistent estimators for a subset 
of the regression parameters are available. 


4.8. Instrumental Variables 


A major complication that is emphasized in microeconometrics is the possibility of 
inconsistent parameter estimation caused by endogenous regressors. Then regression 
estimates measure only the magnitude of association, rather than the magnitude and 
direction of causation, both of which are needed for policy analysis. 

The instrumental variables estimator provides a way to nonetheless obtain consis- 
tent parameter estimates. This method, widely used in econometrics and rarely used 
elsewhere, is conceptually difficult and easily misused. 

We provide a lengthy expository treatment that defines an instrumental variable and 
explains how the instrumental variables method works in a simple setting. 


4.8.1. Inconsistency of OLS 


Consider the scalar regression model with dependent variable y and single regressor x. 
The goal of regression analysis is to estimate the conditional mean function E[y|x]. A 
linear conditional mean model, without intercept for notational convenience, specifies 


ELy|x] = Bx. (4.42) 


This model without intercept subsumes the model with intercept if dependent and 

regressor variables are deviations from their respective means. Interest lies in obtaining 

a consistent estimate of 6 as this gives the change in the conditional mean given an 

exogenous change in x. For example, interest may lie in the effect in earnings caused 

by an increase in schooling attributed to exogenous reasons, such as an increase in the 

minimum age at which students leave school, that are not a choice of the individual. 
The OLS regression model specifies 


y= ßx+u, (4.43) 


where u is an error term. Regression of y on x yields OLS estimate B of $. 

Standard regression results make the assumption that the regressors are uncorrelated 
with the errors in the model (4.43). Then the only effect of x on y is a direct effect via 
the term Bx. We have the following path analysis diagram: 


Xx —> y 


Z 


u 


where there is no association between x and u. So x and u are independent causes 
of y. 

However, in some situations there may be an association between regressors and 
errors. For example, consider regression of log-earnings (y) on years of schooling (x). 
The error term u embodies all factors other than schooling that determine earnings, 
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such as ability. Suppose a person has a high level of u, as a result of high (unobserved) 
ability. This increases earnings, since y = 6x + u, but it may also lead to higher lev- 
els of x, since schooling is likely to be higher for those with high ability. A more 
appropriate path diagram is then the following: 


x — y 


E 


u 


where now there is an association between x and u. 

What are the consequences of this correlation between x and u? Now higher levels 
of x have two effects on y. From (4.43) there is both a direct effect via 6x and an 
indirect effect via u affecting x, which in turn affects y. The goal of regression is 
to estimate only the first effect, yielding an estimate of 6. The OLS estimate will 
instead combine these two effects, giving B > B in this example where both effects 
are positive. Using calculus, we have y = 6x + u(x) with total derivative 


= p4 (4.44) 
dx dx 
The data give information on dy/dx, so OLS estimates the total effect 6 + du/dx 
rather than 8 alone. The OLS estimator is therefore biased and inconsistent for $, 
unless there is no association between x and u. 

A more formal treatment of the linear regression model with K regressors leads to 
the same conclusion. From Section 4.7.1 a necessary condition for consistency of OLS 
is that plim N~'X’u = 0. Consistency requires that the regressors are asymptotically 
uncorrelated with the errors. From (4.37) the magnitude of the inconsistency of OLS 
is (XxX) ' X’u, the OLS coefficient from regression of u on x. This is just the OLS 
estimate of du /dx, confirming the intuitive result in (4.44). 


4.8.2. Instrumental Variable 


The inconsistency of OLS is due to endogeneity of x, meaning that changes in x are 
associated not only with changes in y but also changes in the error u. What is needed 
is a method to generate only exogenous variation in x. An obvious way is through a 
randomized experiment, but for most economics applications such experiments are too 
expensive or even infeasible. 


Definition of an Instrument 


A crude experimental or treatment approach is still possible using observational data, 
provided there exists an instrument z that has the property that changes in z are asso- 
ciated with changes in x but do not lead to change in y (aside from the indirect route 
via x). This leads to the following path diagram: 


Z—7x— y 


MA 


u 
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which introduces a variable z that is causally associated with x but not u. It is still 
the case that z and y will be correlated, but the only source of such correlation is the 
indirect path of z being correlated with x, which in turn determines y. The more direct 
path of z being a regressor in the model for y is ruled out. 

More formally, a variable z is called an instrument or instrumental variable for 
the regressor x in the scalar regression model y = 6x + u if (1) z is uncorrelated with 
the error u and (2) z is correlated with the regressor x. 

The first assumption excludes the instrument z from being a regressor in the model 
for y, since if instead y depended on both x and z, and y is regressed on x alone, then 
z is being absorbed into the error so that z will then be correlated with the error. The 
second assumption requires that there is some association between the instrument and 
the variable being instrumented. 


Examples of an Instrument 


In many microeconometric applications it is difficult to find legitimate instruments. 
Here we provide two examples. 

Suppose we want to estimate the response of market demand to exogenous changes 
in market price. Quantity demanded clearly depends on price, but prices are not ex- 
ogenously given since they are determined in part by market demand. A suitable in- 
strument for price is a variable that is correlated with price but does not directly affect 
quantity demanded. An obvious candidate is a variable that affects market supply, since 
this also affect prices, but is not a direct determinant of demand. An example is a mea- 
sure of favorable growing conditions if an agricultural product is being modeled. The 
choice of instrument here is uncontroversial, provided favorable growing conditions 
do not directly affect demand, and is helped greatly by the formal economic model of 
supply and demand. 

Next suppose we want to estimate the returns to exogenous changes in schooling. 
Most observational data sets lack measures of individual ability, so regression of earn- 
ings on schooling has error that includes unobserved ability and hence is correlated 
with the regressor schooling. We need an instrument z that is correlated with school- 
ing, uncorrelated with ability, and more generally uncorrelated with the error term, 
which means that it cannot directly determine earnings. 

One popular candidate for z is proximity to a college or university (Card, 1995). 
This clearly satisfies condition 2 because, for example, people whose home is a long 
way from a community college or state university are less likely to attend college. It 
most likely satisfies 1, though since it can be argued that people who live a long way 
from a college are more likely to be in low-wage labor markets one needs to estimate 
a multiple regression for y that includes additional regressors such as indicators for 
nonmetropolitan area. 

A second candidate for the instrument is month of birth (Angrist and Krueger, 
1991). This clearly satisfies condition 1 as there is no reason to believe that month 
of birth has a direct effect on earnings if the regression includes age in years. Surpris- 
ingly condition 2 may also be satisfied, as birth month determines age of first entry 
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into school in the USA, which in turn may affect years of schooling since laws often 
specify a minimum school-leaving age. Bound, Jaeger, and Baker (1995) provide a 
critique of this instrument. 

The consequences of choosing poor instruments are considered in detail in Sec- 
tion 4.9. 


4.8.3. Instrumental Variables Estimator 


For regression with scalar regressor x and scalar instrument z, the instrumental vari- 
ables (IV) estimator is defined as 


Bw = Æx Zy, (4.45) 


where, in the scalar regressor case z, x and y are N x 1 vectors. This estimator provides 
a consistent estimator for the slope coefficient 6 in the linear model y = Bx + u if z 
is correlated with x and uncorrelated with the error term. 

There are several ways to derive (4.45). We provide an intuitive derivation, one that 
differs from derivations usually presented such as that in Section 6.2.5. 

Return to the earnings—schooling example. Suppose a one-unit change in the in- 
strument z is associated with 0.2 more years of schooling and with a $500 increase 
in annual earnings. This increase in earnings is a consequence of the indirect effect 
that increase in z led to increase in schooling, which in turn increases income. Then it 
follows that 0.2 years additional schooling is associated with a $500 increase in earn- 
ings, so that a one-year increase in schooling is associated with a $500/0.2 = $2,500 
increase in earnings. The causal estimate of 6 is therefore 2,500. In mathematical 
notation we have estimated the changes dx /dz and dy/dz and calculated the causal 
estimator as 


dy/dz 
dx/dz 


Bw = (4.46) 
This approach to identification of the causal parameter 6 is given in Heckman (2000, 
p. 58); see also the example in Section 2.4.2. 

All that remains is consistent estimation of dy/dz and dx /dz. The obvious way to 
estimate dy/dz is by OLS regression of y on z with slope estimate (z'z)~!z'y. Sim- 
ilarly, estimate dx/dz by OLS regression of x on z with slope estimate (z'z)~'z’x. 
Then 


~~ (z'z) 'z'y 


Ñ = (Zx) 'z'y. (4.47) 


a (TAIX 
4.8.4. Wald Estimator 


A leading simple example of IV is one where the instrument z is a binary instru- 
ment. Denote the subsample averages of y and x by yı and x), respectively, when 
z= 1 and by yo and Xo, respectively, when z = 0. Then Ay/Az = (31 — Yo) and 


98 


4.8. INSTRUMENTAL VARIABLES 


Ax/Az = (X, — Xo), and (4.46) yields 
pga a. (4.48) 
(%1 — Xo) 
This estimator is called the Wald estimator, after Wald (1940), or the grouping esti- 
mator. 

The Wald estimator can also be obtained from the formula (4.45). For the no- 
intercept model variables are measured in deviations from means, so z'y = ));(zi — Z) 
(yi — Y). For binary z this yields z'y = N: (yı — ¥) = Ni No(ğı — ¥o)/N, where No 
and N are the number of observations for which z = 0 and z = 1. This result uses 
Jı — ¥ = (Noi + M1 91)/N — (Noyo + N171)/N = No(91 — ¥o)/N. Similarly, z'x = 
N,No(X1 — Xo)/N. Combining these results, we have that (4.45) yields (4.48). 

For the earnings—schooling example it is being assumed that we can define two 
groups where group membership does not directly determine earnings, though it does 
affect level of schooling and hence indirectly affects earnings. Then the IV estimate is 
the difference in average earnings across the two groups divided by the difference in 
average schooling across the two groups. 


4.8.5. Sample Covariance and Correlation Analysis 


The IV estimator can also be interpreted in terms of covariances or correlations. 
For sample covariances we have directly from (4.45) that 


~ Cov[z, y] 


By = Cae’ (4.49) 


where here Cov[ ] is being used to denote sample covariance. 

For sample correlations, note that the OLS estimator for the model (4.43) can be 
written as Bors = rryV/y'y//x'x, where ryy = x’y/y/(x’x)(y’y) is the sample correla- 
tion between x and y. This leads to the interpretation of the OLS estimator as implying 
that a one standard deviation change in x is associated with an r, standard deviation 
change in y. The problem is that the correlation rxy is contaminated by correlation 
between x and u. An alternative approach is to measure the correlation between x and 
y indirectly by the correlation between z and y divided by the correlation between z 
and x. Then 


(ees (4.50) 
Fzx /X'X 


which can be shown to equal Bw in (4.45). 


4.8.6. IV Estimation for Multiple Regression 
Now consider the multiple regression model with typical observation 
y=xß+u, 


with K regressor variables, so that x and G are K x 1 vectors. 
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Instruments 


Assume the existence of an r x 1 vector of instruments z, with r > K, satisfying the 
following: 


1. zis uncorrelated with the error u. 
2. zis correlated with the regressor vector x. 


3. zis strongly correlated, rather than weakly correlated, with the regressor vector x. 


The first two properties are necessary for consistency and were presented earlier in 
the scalar case. The third property, defined in Section 4.9.1, is a strengthening of the 
second to ensure good finite-sample performance of the IV estimator. 

In the multiple regression case z and x may share some common components. 
Some components of x, called exogenous regressors, may be uncorrelated with u. 
These components are clearly suitable instruments as they satisfy conditions 1 and 
2. Other components of x, called endogenous regressors, may be correlated with u. 
These components lead to inconsistency of OLS and are also clearly unsuitable in- 
struments as they do not satisfy condition 1. Partition x into x = [x} x4], where x; 
contains endogenous regressors and x2 contains exogenous regressors. Then a valid 
instrument is z = [zi x) |’, where x2 can be an instrument for itself, but we need to find 
at least as many instruments Z; as there are endogenous variables x. 


Identification 


Identification in a simultaneous equations model was presented in Section 2.5. Here we 
have a single equation. The order condition requires that the number of instruments 
must at least equal the number of independent endogenous components, so thatr > K. 
The model is said to be just-identified if r = K and overidentified ifr > K. 

In many multiple regression applications there is only one endogenous regressor. 
For example, the earnings on schooling regression will include many other regressors 
such as age, geographic location, and family background. Interest lies in the coefficient 
on schooling, but this is an endogenous variable most likely correlated with the error 
because ability is unobserved. Possible candidates for the necessary single instrument 
for schooling have already been given in Section 4.8.2. 

If an instrument fails the first condition the instrument is an invalid instrument. If 
an instrument fails the second condition the instrument is an irrelevant instrument, 
and the model may be unidentified if too few instruments are relevant. The third con- 
dition fails when very low correlation exists between the instrument and the endoge- 
nous variable being instrumented. The model is said to be weakly identified and the 
instrument is called a weak instrument. 


Instrumental Variables Estimator 


When the model is just-identified, so that r = K, the instrumental variables estima- 
tor is the obvious matrix generalization of (4.45) 


Bw = (ZX) Z'y, (4.51) 
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where Z is an N x K matrix with ith row z,. Substituting the regression model y = 
XG + u for y in (4.51) yields 


By = (ZX) 'Z1X8 + u] 
= B+(ZX)'Zu 
= 6+ (NZX) N“'Z'u. 
It follows immediately that the IV estimator is consistent if 
plim N~'Z'u=0 
and 
plim N~'Z’X Æ 0. 


These are essentially conditions 1 and 2 that z is uncorrelated with u and correlated 
with x. To ensure that the inverse of N~!Z’X exists it is assumed that Z’X is of full 
rank K, a stronger assumption than the order condition that r = K. 

With heteroskedastic errors the IV estimator is asymptotically normal with mean 3 
and variance matrix consistently estimated by 


VIB] = ZNZ, (4.52) 


where Q = Diag[a?]. This result is obtained in a manner similar to that for OLS given 
in Section 4.4.4. 

The IV estimator, although consistent, leads to a loss of efficiency that can be very 
large in practice. Intuitively IV will not work well if the instrument z has low correla- 
tion with the regressor x (see Section 4.9.3). 


4.8.7. Two-Stage Least Squares 


The IV estimator in (4.51) requires that the number of instruments equals the number 
of regressors. For overidentified models the IV estimator can be used, by discarding 
some of the instruments so that the model is just-identified. However, an asymptotic 
efficiency loss can occur when discarding these instruments. 

Instead, a common procedure is to use the two-stage least-squares (2SLS) estima- 
tor 


Basis = [X ZZD ZX] [XZD Zy], (4.53) 


presented and motivated in Section 6.4. 

The 2SLS estimator is an IV estimator. In a just-identified model it simplifies to 
the IV estimator given in (4.51) with instruments Z. In an overidentified model the 
2SLS estimator equals the IV estimator given in (4.51) if the instruments are X, where 
X = Z(Z'Z)'ZX is the predicted value of x from OLS regression of x on z. 

The 2SLS estimator gets its name from the result that it can be obtained by two 
consecutive OLS regressions: OLS regression of x on z to get £ followed by OLS 
of y on Xx, which gives Basis- This interpretation does not necessarily generalize to 
nonlinear regressions; see Section 6.5.6. 
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The 2SLS estimator is often expressed more compactly as 
Basis = [X PX] [X’Pzy]. (4.54) 
where 
Pz = (ZZZ 


is an idempotent projection matrix that satisfies Pz = P}, PzP, = Pz, and PzZ = Z. 
The 2SLS estimator can be shown to be asymptotically normal distributed with esti- 
mated asymptotic variance 


Vi@osis] = N [X'PzX] | [X ZZD SZD Zx] [x'P2X] |, (4.55) 
where i in the usual case of heteroskedastic errors S = N~! Xu u; 2z;z, and t; = yi — 
x. Basis- A commonly used small-sample adjustment is to divide by N — K rather 
than N in the formula for S. 

In the special case that errors are homoskedastic, simplification occurs and 
ViBosis] = = s°[X'P2X] !. This latter result is given in many introductory treatments, 
but the more general formula (4.55) is preferred as the modern approach is to treat 
errors as potentially heteroskedastic. 

For overidentified models with heteroskedastic errors an estimator that White 
(1982) calls the two-stage instrumental variables estimator is more efficient than 
2SLS. Moreover, some commonly used model specification tests require estimation 
by this estimator rather than 2SLS. For details see Section 6.4.2. 


4.8.8. IV Example 


As an example of IV estimation, consider estimation of the slope coefficient of x for 
the dgp 


y=0+0.5x +u, 
x=0+z+v, 


where z ~ N[2, 1] and (u, v) are joint normal with means 0, variances 1, and correla- 
tion 0.8. 

OLS of y on x yields inconsistent estimates as x is correlated with u since by 
construction x is correlated with v, which in turn is correlated with u. IV estimation 
yields consistent estimates. The variable z is a valid instrument since by construction 
is uncorrelated with u but is correlated with x. Transformations of z, such as z>, are 
also valid instruments. 

Various estimates and associated standard errors from a generated data sample of 
size 10,000 are given in Table 4.4. We focus on the slope coefficient. 

The OLS estimator is inconsistent, with slope coefficient estimate of 0.902 being 
more than 50 standard errors from the true value of 0.5. The remaining estimates are 
consistent and are all within two standard errors of 0.5. 

There are several ways to compute the IV estimator. The slope coefficient from 
OLS regression of y on z is 0.5168 and from OLS regression of x on z it is 1.0124, 
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Table 4.4. Instrumental Variables Example‘ 


OLS IV 2SLS IV (23) 
Constant  —0.804 —0.017 —0.017 —0.014 

(0.014) (0.022) (0.032) (0.025) 
x 0.902 0.510 0.510 0.509 

(0.006) (0.010) (0.014) (0.012) 
R? 0.709 0.576 0.576 0.574 


“ Generated data for a sample size of 10,000. OLS is inconsistent and other esti- 
mators are consistent. Robust standard errors are reported though they are unnec- 
essary here as errors are homoskedastic. The 2SLS standard errors are incorrect. 
The data-generating process is given in the text. 


yielding an IV estimate of 0.5168/1.0124 = 0.510 using (4.47). In practice one instead 
directly computes the IV estimator using (4.45) or (4.51), with z used as the instrument 
for x and standard errors computed using (4.52). The 2SLS estimator (see (4.54)) 
can be computed by OLS regression of y on X, where X is the prediction from OLS 
regression of x on z. The 2SLS estimates exactly equal the IV estimates in this just- 
identified model, though the standard errors from this OLS regression of y on X are 
incorrect as will be explained in Section 6.4.5. 

The final column uses z? rather than z as the instrument for x. This alternative IV 
estimator is consistent, since z? is uncorrelated with u and correlated with x. However, 
it is less efficient for this particular dgp, and the standard error of the slope coefficient 
rises from 0.010 to 0.012. 

There is an efficiency loss in IV estimation compared to OLS estimation, see (4.61) 
for a general result for the case of single regressor and single instrument. Here r A 
0.510, not given in Table 4.4, is high so the loss is not great and the standard error of 
the slope coefficient increases somewhat from 0.006 to 0.010. In practice the efficiency 
loss can be much greater than this. 


4.9. Instrumental Variables in Practice 


Important practical issues include determining whether IV methods are necessary and, 
if necessary, determining whether the instruments are valid. The relevant specification 
tests are presented in Section 8.4. Unfortunately, the validity of tests are limited. They 
require the assumption that in a just-identified model the instruments are valid and test 
only overidentifying restrictions. 

Although IV estimators are consistent given valid instruments, as detailed in the 
following, IV estimators can be much less efficient than the OLS estimator and can 
have a finite-sample distribution that for usual finite-sample sizes differs greatly from 
the asymptotic distribution. These problems are greatly magnified if instruments are 
weakly correlated with the variables being instrumented. One way that weak instru- 
ments can arise is if there are many more instruments than needed. This is simply 
dealt with by dropping some of the instruments (see also Donald and Newey, 2001). A 
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more fundamental problem arises when even with the minimal number of instruments 
one or more of the instruments is weak. 
This section focuses on the problem of weak instruments. 


4.9.1. Weak Instruments 


There is no single definition of a weak instrument. Many authors use the following 
signals of a weak instrument, presented here for progressively more complex models. 


e Scalar regressor x and scalar instrument z: A weak instrument is one for which re z ÍS 
small. 


e Scalar regressor x and vector of instruments z: The instruments are weak if the R? from 
regression of x on z, denoted R? p» is small or if the F-statistic for test of overall fit in 
this regression is small. 


e Multiple regressors x with only one endogenous: A weak instrument is one for which 
the partial R? is low or the partial F-statistic is small, where these partial statistics are 
defined toward the end of Section 4.9.1. 


e Multiple regressors x with several endogenous: There are several measures. 


R? Measures 


Consider a single equation 
y = Bix + X48, + u, (4.56) 


where just one regressor x, is endogenous and the remaining regressors in the vector 
X are exogenous. Assume that the instrument vector z includes the exogenous instru- 
ments x2, as well as least one other instrument. 

One possible R? measure is the usual R? from regression of x; on z. However, this 
could be high only because x, is highly correlated with x2 whereas intuitively we really 
need x; to be highly correlated with the instrument(s) other than x2. 

Bound, Jaeger, and Baker (1995) therefore proposed use of a partial R°, denoted 
Re, that purges the effect of x2. R? is obtained as R? from the regression 


(xı — X1) = (z—-2)'7 + v, (4.57) 


where x; and Z are the fitted values from regressions of xı on x2 and z on x2. In the 
just-identified case z — Z will reduce to zı — Z1, where z; is the single instrument other 
than x2 and Z; is the fitted value from regression of zı on x2. 

It is not unusual for R? to be much lower than R? z The formula for R? simplifies 
to r? z When there is only one regressor and it is endogenous. It further simplifies to 
Cor[x, z] when there is only one instrument. 

When there is more than one endogenous variable, analysis is less straightforward 
as a number of generalizations of R? have been proposed. 

Consider a single equation with more than one endogenous variable model and fo- 
cus on estimation of the coefficient of the first endogenous variable. Then in (4.56) 
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xı is endogenous and additionally some of the variables in x% are also endogenous. 
Several alternative measures replace the right-hand side of (4.57) with a residual that 
controls for the presence of other endogenous regressors. Shea (1997) proposed a par- 
tial R?, say Re, that is computed as the squared sample correlation between (x; — X1) 
and (x; — ZI ). Here (xı — X1) is again the residual from regression of x, on x2, whereas 
(x1 — <1) is the residual from regression of x; (the fitted value from regression of x; 
on Z) on X (the fitted value from regression of x, on z). Poskitt and Skeels (2002) pro- 
posed an alternative partial R?, which, like Shea’s Re simplifies to R? when there is 
only one endogenous regressor. Hall, Rudebusch, and Wilcox (1996) instead proposed 
use of canonical correlations. 

These measures for the coefficient for the first endogenous variable can be repeated 
for the other endogenous variables. Poskitt and Skeels (2002) additionally consider an 
R? measure that applies jointly to instrumentation of all the endogenous variables. 

The problems of inconsistency of estimators and loss of precision are magnified 
as the partial R? measures fall, as detailed in Sections 4.9.2 and 4.9.3. See especially 
(4.60) and (4.62). 


Partial F-Statistics 


For poor finite-sample performance, considered in Section 4.9.4, it is common to use 
a related measure, the F-statistic for whether coefficients are zero in regression of the 
endogenous regressor on instruments. 

For a single regressor that is endogenous we use the usual overall F-statistic, for a 
test of m = 0 in the regression x = Z'm + v of the endogenous regressor on the instru- 
ments. This F-statistic is a function of R? ,. 

More commonly, some exogenous regressors also appear in the model, and in model 
(4.56) with single endogenous regressor x; we use the F-statistic for a test of 7, = 0 
in the regression 


X = ZT + XT +0, (4.58) 


where z; are the instruments other than the exogenous regressors and x» are the ex- 
ogenous regressors. This is the first-stage regression in the two-stage least-squares 
interpretation of IV. 

This statistic is used as a signal of potential finite-sample bias in the IV estimator. 
In Section 4.9.4 we explain results of Staiger and Stock (1997) that suggest a value 
less than 10 is problematic and a value of 5 or less is a sign of extreme finite-sample 
bias and we consider extension to more than one endogenous regressor. 


4.9.2. Inconsistency of IV Estimators 


The essential condition for consistency of IV is condition 1 in Section 4.8.6, that 
the instrument should be uncorrelated with the error term. No test is possible in the 
just-identified case. In the overidentified case a test of the overidentifying assump- 
tions is possible (see Section 6.4.3). Rejection then could be due to either instrument 
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endogeneity or model failure. Thus condition 1 is difficult to test directly and deter- 
mining whether an instrument is exogenous is usually a subjective decision, albeit one 
often guided by economic theory. 

It is always possible to create an exogenous instrument through functional form 
restrictions. For example, suppose there are two regressors so that y = 1x1 + Box2 + 
u, with x; uncorrelated with u and xz correlated with u. Note that throughout this 
section all variables are assumed to be measured in departures from means, so that 
without loss of generality the intercept term can be omitted. Then OLS is inconsistent, 
as x2 is endogenous. A seemingly good instrument for x2 is x7, since x? is likely to 
be uncorrelated with u because x; is uncorrelated with u. However, the validity of 
this instrument requires the functional form restriction on the conditional mean that 
xı only enters the model linearly and not quadratically. In practice one should view a 
linear model as only an approximation, and obtaining instruments in such an artificial 
way can be easily criticized. 

A better way to create a valid instrument is through alternative exclusion restric- 
tions that do not rely so heavily on choice of functional form. Some practical examples 
have been given in Section 4.8.2. 

Structural models such as the classical linear simultaneous equations model (see 
Sections 2.4 and 6.10.6) make such exclusion restrictions very explicit. Even then the 
restrictions can often be criticized for being too ad hoc, unless compelling economic 
theory supports the restrictions. 

For panel data applications it may be reasonable to assume that only current data 
may belong in the equation of interest — an exclusion restriction permitting past data 
to be used as instruments under the assumption that errors are serially uncorrelated 
(see Section 22.2.4). Similarly, in models of decision making under uncertainty (see 
Section 6.2.7), lagged variables can be used as instruments as they are part of the 
information set. 

There is no formal test of instrument exogeneity that does not additionally test 
whether the regression equation is correctly specified. Instrument exogeneity in- 
evitably relies on a priori information, such as that from economic or statistical theory. 
The evaluation by Bound et al. (1995, pp. 446—447) of the validity of the instruments 
used by Angrist and Krueger (1991) provides an insightful example of the subtleties 
involved in determining instrument exogeneity. 

It is especially important that an instrument be exogenous if an instrument is weak, 
because with weak instruments even very mild endogeneity of the instrument can lead 
to IV parameter estimates that are much more inconsistent than the already inconsistent 
OLS parameter estimates. 

For simplicity consider linear regression with one regressor and one instrument; 
hence y = Bx + u. Then performing some algebra, left as an exercise, yields 


plimByy —B — Corlz, u] ._! 
plim Bors — 8  Corlx,u] ` Cor[z, x] 


(4.59) 


Thus with an invalid instrument and low correlation between the instrument and the 
regressor, the IV estimator can be even more inconsistent than OLS. For example, 
suppose the correlation between z and x is 0.1, which is not unusual for cross-section 
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data. Then IV becomes more inconsistent than OLS as soon as the correlation coeffi- 
cient between z and u exceeds a mere 0.1 times the correlation coefficient between x 
and u. 

Result (4.59) can be extended to the model (4.56) with one endogenous regressor 
and several exogenous regressors, iid errors, and instruments that include all the ex- 
ogenous regressors. Then 

plim Bi 1s — Bi _ Cor, u] 1 


EEA = x, (4.60) 
plim £i ors — Bi = Corlx,u] R5 


where R?, is defined after (4.56). For extension to more than one endogenous regressor 
see Shea (1997). 

These results, emphasized by Bound et al. (1995), have profound implications for 
the use of IV. If instruments are weak then even mild instrument endogeneity can lead 
to IV being even more inconsistent than OLS. Perhaps because the conclusion is so 
negative, the literature has neglected this aspect of weak instruments. A notable recent 
exception is Hahn and Hausman (2003a). 

Most of the literature assumes that condition 1 is satisfied, so that IV is consistent, 
and focuses on other complications attributable to weak instruments. 


4.9.3. Low Precision 


Although IV estimation can lead to consistent estimation when OLS is inconsistent, it 
also leads to a loss in precision. Intuitively, from Section 4.8.2 the instrument z is a 
treatment that leads to exogenous movement in x but does so with considerable noise. 

The loss in precision increases, and standard errors increase, with weaker instru- 
ments. This is easily seen in the simplest case of a single endogenous regressor and 
single instrument with iid errors. Then the asymptotic variance is 


ViBwl = 0X2 zzz x! (4.61) 
= [o7/x’x]/[(2’x)? /(z'z)(x'x)] 
= Viborsl/??,. 


For example, if the squared sample correlation coefficient between z and x equals 0.1, 
then IV standard errors are 10 times those of OLS. Moreover, the IV estimator has 
larger variance than the OLS estimator unless Cor[z, x] = 1. 

Result (4.61) can be extended to the model (4.56) with one endogenous regressor 
and several exogenous regressors, iid errors, and instruments that include all the ex- 
ogenous regressors. Then 


se[B 1 2s15] = seli ors]/Rp, (4.62) 


where se[-] denotes asymptotic standard error and R? is defined after (4.56). For exten- 
sion to more than one endogenous regressor this R?, is replaced by the Re proposed 
by Shea (1997). This provided the motivation for Shea’s test statistic. 

The poor precision is concentrated on the coefficients for endogenous variables. For 
exogenous variables the standard errors for 2SLS coefficient estimates are similar to 
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those for OLS. Intuitively, exogenous variables are being instrumented by themselves, 
so they have a very strong instrument. 

For the coefficients of an endogenous regressor it is a low partial R?, rather than R?, 
that leads to a loss of estimator precision. This explains why 2SLS standard errors can 
be much higher than OLS standard errors despite the high raw correlation between the 
endogenous variable and the instruments. Going the other way, 2SLS standard errors 
for coefficients of endogenous variables that are much larger than OLS standard errors 
provide a clear signal that instruments are weak. 

Statistics used to detect low precision of IV caused by weak instruments are called 
measures of instrument relevance. To some extent they are unnecessary as the prob- 
lem is easily detected if IV standard errors are much larger than OLS standard errors. 


4.9.4. Finite-Sample Bias 


This section summarizes a relatively challenging and as yet unfinished literature on 
“weak instruments” that focuses on the practical problem that even in “large” samples 
asymptotic theory can provide a poor approximation to the distribution of the IV esti- 
mator. In particular the IV estimator is biased in finite samples even if asymptotically 
consistent. The bias can be especially pronounced when instruments are weak. 

This bias of IV, which is toward the inconsistent OLS estimator, can be remark- 
ably large, as demonstrated in a simple Monte Carlo experiment by Nelson and Startz 
(1990), and by a real data application involving several hundred thousand observations 
but very weak instruments by Bound et al. (1995). Moreover, the standard errors can 
also be very biased, as also demonstrated by Nelson and Startz (1990). 

The theoretical literature entails quite specialized and advanced econometric theory, 
as it is actually difficult to obtain the sample mean of the IV estimator. To see this, 
consider adapting to the IV estimator the usual proof of unbiasedness of the OLS 
estimator given in Section 4.4.8. For Bw defined in (4.51) for the just-identified case 
this yields 


E[Biv] = 8 +Ezxul(Z’X) 'Z’ul 
= 6 + Ezx (Z' XZ x [E[ulZ, XI], 


where the unconditional expectation with respect to all stochastic variables, Z, X, 
and u, is obtained by first taking expectation with respect to u conditional on Z 
and X, using the law of Iterated Expectations (see Section A.8.). An obvious suf- 
ficient condition for the IV estimator to have mean @ is that E[u|Z, X] = 0. This 
assumption is too strong, however, because it implies E[u|X] = 0, in which case 
there would be no need to instrument in the first place. So there is no simple way 
to obtain E[ĝðy]. A similar problem does not arise in establishing consistency. Then 
By =Bt+ (N ALX) N~'Z’u, where the term N~!Z’u can be considered in isola- 
tion of X and the assumption E[u|Z] = 0 leads to plim N~'Z’u = 0. 

Therefore we need to use alternative methods to obtain the mean of the IV estimator. 
Here we merely summarize key results. 
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Initial research made the strong assumption of joint normality of variables and ho- 
moskedastic errors. Then the IV estimator has a Wishart distribution (defined in Chap- 
ter 13). Surprisingly, the mean of the IV estimator does not even exist in the just- 
identified case, a signal that there may be finite-sample problems. The mean does exist 
if there is at least one overidentifying restriction, and the variance exists if there are at 
least two overidentifying restrictions. Even when the mean exists the IV estimator is 
biased, with bias in the direction of OLS. With more overidentifying restrictions the 
bias increases, eventually equaling the bias of the OLS estimator. A detailed discussion 
is given in Davidson and MacKinnon (1993, pp. 221-224). Approximations based on 
power-series expansions have also been used. 

What determines the size of the finite-sample bias? For regression with a single 
regressor x that is endogenous and is related to the instruments z by the reduced form 
model x = zm + v, the concentration parameter t° is defined as t? = n'ZZ'r /o2. 
The bias of IV can be shown to be an increasing function of t?. The quantity t7/K, 
where K is the number of instruments, is the population analogue of the F-statistic 
for a test of whether m = 0. The statistic F — 1, where F is the actual F-statistic in 
the first-stage reduced form model, can be shown to be an approximately unbiased 
estimate of t7/K. This leads to tests for finite-sample bias being based on the F- 
statistic given in Section 4.9.2. 

Staiger and Stock (1997) obtained results under weaker distributional assumptions. 
In particular, normality is no longer needed. Their approach uses weak instrument 
asymptotics that find the limit distribution of IV estimators for a sequence of models 
with t7/K held constant as N — oo. In a simple model 1/F provides an approximate 
estimate of the finite-sample bias of the IV estimator relative to OLS. More generally, 
the extent of the bias for given F varies with the number of endogenous regressors and 
the number of instruments. Simulations show that to ensure that the maximal bias in 
IV is no more than 10% that of OLS we need F > 10. This threshold is widely cited 
but falls to around 6.5, for example, if one is comfortable with bias in IV of 20% of 
that for OLS. So a less strict rule of thumb is F > 5. Shea (1997) demonstrated that 
low partial R? is also associated with finite-sample bias but there is no similar rule of 
thumb for use of partial R? as a diagnostic for finite-sample bias. 

For models with more than one endogenous regressor, separate F-statistics can be 
computed for each endogenous regressor. For a joint statistic Stock, Wright and Yogo 
(2002) propose using the minimum eigenvalue of a matrix analogue of the first-stage 
test F-statistic. Stock and Yogo (2003) present relevant critical values for this eigen- 
value as the desired degree of bias, the number of endogenous variables, and the num- 
ber of overidentifying restrictions vary. These tables include the single endogenous 
regressor as a special case and presume at least two overidentifying restrictions, so 
they do not apply to just-identified models. 

Finite-sample bias problems arise not only for the IV estimate but also for IV stan- 
dard errors and test statistics. Stock et al. (2002) present a similar approach to Wald 
tests whereby a test of 6 = fo at a nominal level of 5% is to have actual size of, say, 
no more than 15%. Stock and Yogo (2003) also present detailed tables taking this size 
distortion approach that include just-identified models. 
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4.9.5. Responses to Weak Instruments 


What can the practitioner do in the face of weak instruments? 

As already noted one approach is to limit the number of instruments used. This can 
be done by dropping instruments or by combining instruments. 

If finite-sample bias is a concern then alternative estimators may have better small- 
sample properties than 2SLS. A number of alternatives, many variants of IV, are pre- 
sented in Section 6.4.4. 

Despite the emphasis on finite-sample bias the other problems created by weak 
instruments may be of greater importance in applications. It is possible with a large 
enough sample for the first-stage reduced form F-statistic to be large enough that 
finite-sample bias is not a problem. Meanwhile, the partial R? may be very small, 
leading to fragility to even slight correlation between the model error and instrument. 
This is difficult to test for and to overcome. 

There also can be great loss in estimator precision, as detailed in Sections 4.9.3 
and 4.9.4. In such cases either larger samples are needed or alternative approaches to 
estimating causal marginal effects must be used. These methods are summarized in 
Section 2.8 and presented elsewhere in this book. 


4.9.6. IV Application 


Kling (2001) analyzed in detail the use of college proximity as an instrument for 
schooling. Here we use the same data from the NLS young men’s cohort on 3,010 
males aged 24 to 34 years old in 1976 as used to produce Table 1 of Kling (2001) and 
originally used by Card (1995). The model estimated is 


In w; = a + isi + Bre; + bse? + X47 + Ui, 


where s denotes years of schooling, e denotes years of work experience, e? denotes ex- 
perience squared, and x2 is a vector of 26 control variables that are mainly geographic 
indicators and measure of parental education. 

The schooling variable is considered endogenous, owing to lack of data on ability. 
Additionally, the two work experience variables are endogenous, since work experi- 
ence is calculated as age minus years of schooling minus six, as is common in this 
literature, and schooling is endogenous. At least three instruments are needed. 

Here exactly three instruments are used, so the model is just-identified. The first 
instrument is col4, an indicator for whether a four-year college is nearby. This instru- 
ment has already been discussed in Section 4.8.2. The other two instruments are age 
and age squared. These are highly correlated with experience and experience squared, 
yet it is believed they can be omitted from the model for log-wage since it is work 
experience that matters. The remaining regressor vector Xx, is used as an instrument for 
itself. 

Although age is clearly exogenous, some unobservables such as social skills may be 
correlated with both age and wage. Then the use of age and age squared as instruments 
can be questioned. This illustrates the general point that there can be disagreement on 
assumptions of instrument validity. 
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Table 4.5. Returns to Schooling: Instrumental 
Variables Estimates“ 


OLS IV 
Schooling (s) 0.073 0.132 
(0.004) (0.049) 
R? 0.304 0.207 
Shea’s partial R? - 0.006 
First-stage F-statistic for s - 8.07 


“ Sample of 3,010 young males. Dependent variable is log hourly 
wage. Coefficient and standard error for schooling given; esti- 
mates for experience, experience squared, 26 control variables, 
and an intercept are not reported. For the three endogenous re- 
gressors — schooling (s), experience (e), and experience squared 
(e?) — the three instruments are an indicator for whether a four- 
year college (col) is nearby, age, and age squared. The partial 
R? and first-stage F-statistic are weak instruments diagnostics 
explained in the test. 


Results are given in Table 4.5. The OLS estimate of 6; is 0.073, so that wages 
rise by 7.6% (= 100 x (e°73 — 1)) on average with each extra year of schooling. This 
estimate is an inconsistent estimate of 6; given omitted ability. The IV estimate, or 
equivalently the 2SLS estimate since the model is just-identified, is 0.132. An extra 
year of schooling is estimated to lead to a 14.1% (= 100 x (e!?? — 1)) increase in 
wage. 

The IV estimator is much less efficient than OLS. A formal test does not reject ho- 
moskedasticity and we follow Kling (2001) and use the usual standard errors, which 
are very close to the heteroskedastic-robust standard errors. The standard error of 
Bi. ots is 0.004 whereas that for Be tv is 0.049, over 10 times larger. The standard 
errors for the other two endogenous regressors are about 4 times larger and the stan- 
dard errors for the exogenous regressors are about 1.2 times larger. The R? falls from 
0.304 to 0.207. 

R? measures confirm that the instruments are not very relevant for schooling. A 
simple test is to note that the regression (4.58) of schooling on all of the instruments 
yields R? = 0.297, which only falls a little to R? = 0.291 if the three additional in- 
struments are dropped. More formally, Shea’s partial R? here equals 0.0064 = 0.087, 
which from (4.62) predicts that the standard error of Bi. ry Will be inflated by a multiple 
12.5 = 1/0.08, very close to the inflation observed here. This reduces the t-statistic on 
schooling from 19.64 to 2.68. In many applications such a reduction would lead to sta- 
tistical insignificance. In addition, from Section 4.9.2 even slight correlation between 
the instrument col4; and the error term u; will lead to inconsistency of IV. 

To see whether finite-sample bias may also be a problem we run the regression 
(4.58) of schooling on all of the instruments. Testing the joint significance of the three 
additional instruments yields an F-statistic of 8.07, suggesting that the bias of IV may 
be 10 or 20% that of OLS. A similar regression for the other two endogenous variables 
yields much higher F-statistics since, for example, age is a good additional instrument 
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for experience. Given that there are three endogenous regressors it is actually bet- 
ter to use the method of Stock et al. (2002) discussed in Section 4.9.4, though here the 
problem is restricted to schooling since for experience and experience squared, respec- 
tively, Shea’s partial R? equals 0.0876 and 0.0138, whereas the first-stage F-statistics 
are 1,772 and 1,542. 

If additional instruments are available then the model becomes overidentified and 
standard procedure is to additionally perform a test of overidentifying restrictions (see 
Section 8.4.4). 


4.10. Practical Considerations 


The estimation procedures in this chapter are implemented in all standard economet- 
rics packages for cross-section data, except that not all packages implement quantile 
regression. Most provide robust standard errors as an option rather than the default. 

The most difficult estimator to apply can be the instrumental variables estimator, as 
in many potential applications it can be difficult to obtain instruments that are uncor- 
related with the error yet reasonably correlated with the regressor or regressors being 
instrumented. Such instruments can be obtained through specification of a complete 
structural model, such as a simultaneous equations system. Current applied research 
emphasizes alternative approaches such as natural experiments. 


4.11. Bibliographic Notes 


The results in this chapter are presented in many first-year graduate texts, such as those by 
Davidson and MacKinnon (2004), Greene (2003), Hayashi (2000), Johnston and diNardo 
(1997), Mittelhammer, Judge, and Miller (2000), and Ruud (2000). We have emphasized re- 
gression with stochastic regressors, robust standard errors, quantile regression, endogeneity, 
and instrumental variables. 


4.2 Manski (1991) has a nice discussion of regression in a general setting that includes discus- 
sion of the loss functions given in Section 4.2. 

4.3 The returns to schooling example is well studied. Angrist and Krueger (1999) and Card 
(1999) provide recent surveys. 

4.4 For a history of least squares see Stigler (1986). The method was introduced by Legendre 
in 1805. Gauss in 1810 applied least squares to the linear model with normally distributed 
error and proposed the elimination method for computation, and in later work he proposed 
the theorem now called the Gauss—Markov theorem. Galton introduced the concept of re- 
gression, meaning mean-reversion in the context of inheritance of family traits, in 1887. 
For an early “modern” treatment with application to pauperism and welfare availability see 
Yule (1897). Statistical inference based on least-squares estimates of the linear regression 
model was developed most notably by Fisher. The heteroskedastic-consistent estimate of 
the variance matrix of the OLS estimator, due to White (1980a) building on earlier work 
by Eicker (1963), has had a profound impact on statistical inference in microeconometrics 
and has been extended to many settings. 

4.6 Boscovich in 1757 proposed a least absolute deviations estimator that predates least 
squares; see Stigler (1986). A review of quantile regression, introduced by Koenker and 
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Bassett (1978), is given in Buchinsky (1994). A more elementary exposition is given in 
Koenker and Hallock (2001). 

The earliest known use of instrumental variables estimation to secure identification in a 
simultaneous equations setting was by Wright (1928). Another oft-cited early reference is 
Reiersol (1941), who used instrumental variables methods to control for measurement error 
in the regressors. Sargan (1958) gives a classic early treatment of IV estimation. Stock and 
Trebbi (2003) provide additional early references. 

Instrumental variables estimation is presented in econometrics texts, with emphasis on al- 
gebra but not necessarily intuition. The method is widely used in econometrics because of 
the desirability of obtaining estimates with a causal interpretation. 

The problem of weak instruments was drawn to the attention of applied researchers by 
Nelson and Startz (1990) and Bound et al. (1995). There are a number of theoretical an- 
tecedents, most notably the work of Nagar (1959). The problem has dampened enthusiasm 
for IV estimation, and small-sample bias owing to weak instruments is currently a very 
active research topic. Results often assume iid normal errors and restrict analysis to one 
endogenous regressor. The survey by Stock et al. (2002) provides many references with 
emphasis on weak instrument asymptotics. It also briefly considers extensions to nonlinear 
models. The survey by Hahn and Hausman (2003b) presents additional methods and results 
that we have not reviewed here. For recent work on bias in standard errors see Bond and 
Windmeijer (2002). For a careful application see C.-I. Lee (2001). 


Exercises 


4—1 Consider the linear regression model y; = xX: 6 + u; with nonstochastic regressors 


x; and error u; that has mean zero but is correlated as follows: E[ujuj] = o° if 
i= j, Elujuj]= po? if |i— j| =1, and E[uju;] = 0 if |i — j| > 1. Thus errors for 
immediately adjacent observations are correlated whereas errors are otherwise 
uncorrelated. In matrix notation we have y = XG + u, where Q = E[uu’]. For this 
model answer each of the following questions using results given in Section 4.4. 


(a) Verify that Q is a band matrix with nonzero terms only on the diagonal and 
on the first off-diagonal; and give these nonzero terms. 

(b) Obtain the asymptotic distribution of Bors using (4.19). 

(c) State how to obtain a consistent estimate of V[BoLs] that does not depend on 
unknown parameters. 

(d) Is the usual OLS output estimate s*(X’X)—' a consistent estimate of Vios]? 

(e) Is White’s heteroskedasticity robust estimate of ViBots] consistent here? 


4-2 Suppose we estimate the model y; = u + Ui, where u; ~ NTO, oÊ]. 


(a) Show that the OLS estimator of u simplifies to Z = y. 
(b) Hence directly obtain the variance of y. Show that this equals White’s het- 
eroskedastic consistent estimate of the variance given in (4.21). 


4-3 Suppose the dgp is y; = Box; + uj, U; = Xi£i, X; ~ NTO, 1], and e; ~ NTO, 1]. As- 


sume that data are independent over i and that x; is independent of ¢;. Note that 
the first four central moments of \/[0, o°] are 0, o?, 0, and 30%. 


(a) Show that the error term u; is conditionally heteroskedastic. 
(b) Obtain plim N-'X’X. [Hint: Obtain E[x?] and apply a law of large numbers.] 
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(c) Obtain og = V[u;], where the expectation is with respect to all stochastic vari- 
ables in the model. 

(d) Obtain plim N-'X’QX = lim N-'E[X’NoX], where Qo = Diag[V[u;|x]]. 

(e) Using answers to the preceding parts give the default OLS result (4.22) for the 
variance matrix in the limit distribution of VN@os — Bo), ignoring potential 
heteroskedasticity. Your ultimate answer should be numerical. 

(f) Now give the variance in the limit distribution of /N(Bows — Bo), taking ac- 
count of any heteroskedasticity. Your ultimate answer should be numerical. 

(g) Do any differences between answers to parts (e) and (f) accord with your 
prior beliefs? 


4—4 Consider the linear regression model with scalar regressor y; = 6x; + u; with data 
(yj, Xi) iid over i though the error may be conditionally heteroskedastic. 


(a) Show that (Bors — £) = (N! 0 x7)" N Y; Xu. 

(b) Apply Kolmogorov law of large numbers (Theorem A.8) to the averages of x? 
and x;u; to show that Bois 5 B. State any additional assumptions made on 
the dgp for x; and uj. 

Apply the Lindeberg-Levy central limit theorem (Theorem A.14) to the aver- 
ages of xju; to show that N=! >, xju;/N-? >, Elu? xe]  N{0, 1]. State any 
additional assumptions made on the dgp for x; and uj. 

Use the product limit normal rule (Theorem A.17) to show that part (c) implies 
N-12 5; xu; Æ NTO, limN-' >, E[u? xÊ]. State any assumptions made on 
the dgp for x; and uj. 

(e) Combine results using (2.14) and the product limit normal rule (Theorem 
A.17) to obtain the limit distribution of £. 

4-5 Consider the linear regression model y = XG + u. 

(a) Obtain the formula for 8 that minimizes Q(3) = u’Wu, where W is of full rank. 
[Hint: The chain rule for matrix differentiation for column vectors x and z is 
df (x)/8x = (Əz'/Əx) x (AF (z)/dz), for f(x) = f(g(x)) = f(z) where z =g(x)]. 

(b) Show that this simplifies to the OLS estimator if W = I. 

(c) Show that this gives the GLS estimator if W = 7". 

(d) Show that this gives the 2SLS estimator if W = Z(Z' Z) ' Z. 


4-6 Consider IV estimation (Section 4.8) of the model y = x’3 + u using instruments 

z in the just-identified case with Z an N x K matrix of full rank. 

(a) What essential assumptions must z satisfy for the IV estimator to be consis- 
tent for 8? Explain. 

(b) Show that given just identification the 2SLS estimator defined in (4.53) re- 
duces to the IV estimator given in (4.51). 

(c) Give a real-world example of a situation where IV estimation is needed be- 
cause of inconsistency of OLS, and specify suitable instruments. 


(c 


~ 


(d 


— 


4-7 (Adapted from Nelson and Startz, 1990.) Consider the three-equation model, y = 
Bx+u; X=)U+¢e; Z= ye + v, where the mutually independent errors u, £, and 
v are iid normal with mean 0 and variances, respectively, oĉ, oĉ, and o2. 
(a) Show that plim(Bo.g5 — B) = 202/ (4202 + 02). 
(b) Show that pẹ > = yo?/(A202 + o?2)(y?o? + o2). 
(c) Show that Bi = My/Mzx = B + My /(AMy + Mz), where, for example, My = 


Zy. 
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(d) Show that Biy — B > 1/A as y (or pxz) > O. 
(e) Show that By -> cas My —> —yo?/d. 
(f) What do the last two results imply regarding finite-sample biases and the 
moments of By — B when the instruments are poor? 
4-8 Select a 50% random subsample of the Section 4.6.4 data on log health expen- 
diture (y) and log total expenditure (x). 
(a) Obtain OLS estimates and contrast usual and White standard errors for the 
slope coefficient. 
(b) Obtain median regression estimates and compare these to the OLS esti- 
mates. 
(c) Obtain quantile regression estimates for q = 0.25 and q = 0.75. 
(d) Reproduce Figure 4.2 using your answers from parts (a)—(c). 
4-9 Select a 50% random subsample of the Section 4.9.6 data on earnings and edu- 
cation, and reproduce as much of Table 4.5 as possible and provide appropriate 
interpretation. 
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Maximum Likelihood and 
Nonlinear Least-Squares 
Kstimation 


5.1. Introduction 


A nonlinear estimator is one that is a nonlinear function of the dependent variable. 
Most estimators used in microeconometrics, aside from the OLS and IV estimators in 
the linear regression model presented in Chapter 4, are nonlinear estimators. Nonlin- 
earity can arise in many ways. The conditional mean may be nonlinear in parameters. 
The loss function may lead to a nonlinear estimator even if the conditional mean is 
linear in parameters. Censoring and truncation also lead to nonlinear estimators even 
if the original model has conditional mean that is linear in parameters. 

Here we present the essential statistical inference results for nonlinear estimation. 
Very limited small-sample results are available for nonlinear estimators. Statistical in- 
ference is instead based on asymptotic theory that is applicable for large samples. The 
estimators commonly used in microeconometrics are consistent and asymptotically 
normal. 

The asymptotic theory entails two major departures from the treatment of the linear 
regression model given in an introductory graduate course. First, alternative methods 
of proof are needed since there is no direct formula for most nonlinear estimators. 
Second, the asymptotic distribution is generally obtained under the weakest distri- 
butional assumptions possible. This departure was introduced in Section 4.4 to permit 
heteroskedasticity-robust inference for the OLS estimator. Under such weaker assump- 
tions the default standard errors reported by a simple regression program are invalid. 
Some care is needed, however, as these weaker assumptions can lead to inconsistency 
of the estimator itself, a much more fundamental problem. 

As much as possible the presentation here is expository. Definitions of conver- 
gence in probability and distribution, laws of large numbers (LLN), and central limit 
theorems (CLT) are presented in many texts, and here these topics are relegated to 
Appendix A. Applied researchers rarely aim to formally prove consistency and asymp- 
totic normality. It is not unusual, however, to encounter data applications with estima- 
tion problems sufficiently recent or complex as to demand reading recent econometric 
journal articles. Then familiarity with proofs of consistency and asymptotic normality 
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is very helpful, especially to obtain a good idea in advance of the likely form of the 
variance matrix of the estimator. 

Section 5.2 provides an overview of key results. A more formal treatment of 
extremum estimators that maximize or minimize any objective function is given in Sec- 
tion 5.3. Estimators based on estimating equations are defined and presented in Sec- 
tion 5.4. Statistical inference based on robust standard errors is presented briefly in 
Section 5.5, with complete treatment deferred to Chapter 7. Maximum likelihood es- 
timation and quasi-maximum likelihood estimation are presented in Sections 5.6 and 
5.7. Nonlinear least-squares estimation is given in Section 5.8. Section 5.9 presents a 
detailed example. 

The remaining leading parametric estimation procedures — generalized method 
of moments and nonlinear instrumental variables — are given separate treatment in 
Chapter 6. 


5.2. Overview of Nonlinear Estimators 


This section provides a summary of asymptotic properties of nonlinear estimators, 
given more rigorously in Section 5.3, and presents ways to interpret regression co- 
efficients in nonlinear models. The material is essential for understanding use of the 
cross-section and panel data models presented in later chapters. 


5.2.1. Poisson Regression Example 


It is helpful to introduce a specific example of nonlinear estimation. Here we consider 
Poisson regression, analyzed in more detail in Chapter 20. 

The Poisson distribution is appropriate for a dependent variable y that takes only 
nonnegative integer values 0, 1, 2, .... It can be used to model the number of occur- 
rences of an event, such as number of patent applications by a firm and number of 
doctor visits by an individual. 

The Poisson density, or more formally the Poisson probability mass function, with 
rate parameter À is 


FOIN =e fy y=0,1,2,..., 


where it can be shown that E[y] = à and V[y] =A. 

A regression model specifies the parameter à to vary across individuals according 
to a specific function of regressor vector x and parameter vector 8. The usual Poisson 
specification is 


A = exp(x’G), 


which has the advantage of ensuring that the mean A > 0. The density of the Poisson 
regression model for a single observation is therefore 


fOlx, B) = eP exp(x’B)/y!. (5.1) 
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Consider maximum likelihood estimation based on the sample {(y;, x;),i = 
1,..., N}. The maximum likelihood (ML) estimator maximizes the log-likelihood 
function (see Section 5.6). The likelihood function is the joint density, which given 
independent observations is the product [ |; f();|x;, 8) of the individual densities, 
where we have conditioned on the regressors. The log-likelihood function is then the 
log of a product, which equals the sum of logs, or $`; In f(yi|x;, B). 

For the Poisson density (5.1), the log-density for the ith observation is 


In f (yilx;, B) = — exp(x, b) + yix; B — In y;!. 


So the Poisson ML estimator 3 maximizes 


1 
OnO = = Do, [PEA + yx — In yi!} (5.2) 


where the scale factor 1/N is included so that Qy(Q) remains finite as N — oo. The 
Poisson ML estimator is the solution to the first-order conditions 3 Q y(3)/0G la = 9, 
or 


1 
x DO, Or — expa Bila = 0. (5.3) 


There is no explicit solution for B in (5.3). Numerical methods to compute B are 
given in Chapter 10. In this chapter we instead focus on the statistical properties of the 
resulting estimate 6. 


5.2.2. m-Estimators 


More generally, we define an m-estimator 0 of the q x 1 parameter vector 0 as an esti- 
mator that maximizes an objective function that is a sum or average of N subfunctions 


1 
OnO) = Yo 10X0), (5.4) 


where q(-) is a scalar function, y; is the dependent variable, x; is a regressor vector, 
and the results in this section assume independence over i. 

For simplicity y; is written as a scalar, but the results extend to vector y; and so 
cover multivariate and panel data and systems of equations. The objective function is 
subscripted by N to denote that it depends on the sample data. Throughout the book 
q is used to denote the dimension of 0. Note that here q is additionally being used to 
denote the subfunction q(-) in (5.4). 

Many econometrics estimators and models are m-estimators, corresponding to spe- 
cific functional forms for q(y, x, 0). Leading examples are maximum likelihood (see 
(5.39) later) and nonlinear least squares (NLS) (see (5.67) later). The Poisson ML 
estimator that maximizes (5.2) is an example of (5.4) with 0 = and q(y, x, 8) = 
— exp(x' 6) + yx’G—Iny!. 

We focus attention on the estimator that is computed as the solution to the asso- 
ciated first-order conditions 3 Q y(0)/30|ş = 0, or equivalently 


ST 3q (Yi, Xi, 9) 


DD =0. (5.5) 


6 
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This is a system of q equations in q unknowns that generally has no explicit solution 
for 0. 

The term m-estimator, attributed to Huber (1967), is interpreted as an abbrevia- 
tion for maximum-likelihood-like estimator. Many econometrics authors, including 
Amemiya (1985, p. 105), Greene (2003, p. 461), and Wooldridge (2002, p. 344), define 
an m-estimator as optimizing over a sum of terms, as in (5.4). Other authors, including 
Serfling (1980), define an m-estimator as solutions of equations such as (5.5). Huber 
(1967) considered both cases and Huber (1981, p. 43) explicitly defined an m-estimator 
in both ways. In this book we call the former type of estimator an m-estimator 
and the latter an estimating equations estimator (which will be treated separately in 
Section 5.4). 


5.2.3. Asymptotic Properties of m-Estimators 


The key desirable asymptotic properties of an estimator are that it be consistent and 
that it have an asymptotic distribution to permit statistical inference at least in large 
samples. 


Consistency 


The first step in determining the properties of @ is to define exactly what 0 is intended 
to estimate. We suppose that there is a unique value of 0, denoted @p and called the 
true parameter value, that generates the data. This identification condition (see Sec- 
tion 2.5) requires both correct specification of the component of the dgp of interest and 
uniqueness of this representation. Thus for the Poisson example it may be assumed that 
the dgp is one with Poisson parameter exp(x’ 3p) and x is such that x’) = x'8® if 
and only if B® = B®. 

The formal notation with subscript 0 for the true parameter value is used extensively 
in Chapters 5 to 8. The motivation is that 0 can take many different values, but interest 
lies in two particular values — the true value ĝo and the estimated value @. 

The estimate Ô will never exactly equal ĝo, even in large samples, because of the 
intrinsic randomness of a sample. Instead, we require 6 to be consistent for Oo (see 
Definition A.2 in Appendix A), meaning that 6 must converge in probability to 00, 
denoted 6 % Oo. 

Rigorously establishing consistency of m-estimators is difficult. Formal results are 
given in Section 5.3.2 and a useful informal condition is given in Section 5.3.7. Spe- 
cializations to ML and NLS estimators are given in later sections. 


Limit Normal Distribution 


Given consistency, as N — oo the estimator ĝ has a distribution with all mass at 0o. As 
for OLS, we magnify or rescale 0 by multiplication by VN to obtain a random variable 
that has nondegenerate distribution as N — oo. Statistical inference is then conducted 
assuming N is large enough for asymptotic theory to provide a good approximation, 
but not so large that 6 collapses on Oo. 
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We therefore consider the behavior of VN @ — 0o). For most estimators this has a 
finite-sample distribution that is too complicated to use for inference. Instead, asymp- 
totic theory is used to obtain the limit of this distribution as N —> oo. For most microe- 
conometrics estimators this limit is the multivariate normal distribution. More formally 
WAN ©) — 0o) converges in distribution to the multivariate normal, where convergence 
in distribution is defined in Appendix A. 

Recall from Section 4.4 that the OLS estimator can be expressed as 


VN(B— Bo) = (5 Ty xa) ee q MiMi 


and the limit distribution was derived by obtaining the probability limit of the first term 
on the right-hand side and the limit normal distribution of the second term. The limit 
distribution of an m-estimator is obtained in a similar way. In Section 5.3.3 we show 
that for an estimator that solves (5.5) we can always write 


~ 2 : ot 
JNO bo) = ( 1 y. 3q: (0) i, ay ae) 
JN 0 


N i=1 9000’ g+ 
where q; (0) = q4(Yi, Xi, 9), for some O between 6 and 0o, provided second derivatives 
and the inverse exist. This result is obtained by a Taylor series expansion. 
Under appropriate assumptions this yields the following limit distribution of an 
m-estimator: 


l (5.6) 


0 


VN(@ — 6) S N10, A7'BoA7"], (5.7) 


where Aj ' is the probability limit of the first term in the right-hand side of (5.6), and 
the second term is assumed to converge to the V[0, Bo] distribution. The expressions 
for Ag and Bg are given in Table 5.1. 


Asymptotic Normality 


To obtain the distribution of @ from the limit distribution result (5.7), divide the left- 
hand side of (5.7) by VN and hence divide the variance by N. Then 


0 ~ N [0o Vð], (5.8) 
where ~ means “is asymptotically distributed as,” and V[6] denotes the asymptotic 
variance of 0 with 

V{0] = N~'Ag'BoAg!. (5.9) 


A complete discussion of the term asymptotic distribution has already been given in 
Section 4.4.4, and is also given in Section A.6.4. 

The result (5.9) depends on the unknown true parameter Qo. It is implemented by 
computing the estimated asymptotic variance 


FONA BA, (5.10) 
where A and B are consistent estimates of Ao and Bo. 
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Table 5.1. Asymptotic Properties of m-Estimators 


Property“ Algebraic Formula 
Objective function Qn (0) = N7! >; qi, Xi, 9) is maximized wrt 0 
Examples ML: qi = In f(y; |x;, 9) is the log-density 

NLS: qi = —(y; — g(x;, 0)? is minus the squared error 
First-order conditions dO0n(0)/00=N7! sae dq(yi, Xi, 9)/0O|g = 0. 
Consistency Is plim Qy(9@) maximized at 0 = 00? 
Consistency (informal) Does E[ ðq (Yi, Xi, 0)/30lo,] =0? 
Limit distribution JNO — 0) 5 NTO, A) BoA; 1 


Ao = plim N~! yi", 0°qi(0)/3006'|,, 
Bo = plimN-! >), 4q;/00x9qi/06'|,, - 
Asymptotic distribution 0< N[@o, NA BA-!] 
Creer 
= 1-1 9g: /00x0q;/06' |g 


^ The limit distribution variance and asymptotic variance estimate are robust sandwich forms that assume 
independence over 7. See Section 5.5.2 for other variance estimates. 


The default output for many econometrics packages instead often uses a simpler 
estimate VA] = —N~-!A~! that is only valid in some special cases. See Section 5.5 
for further discussion, including various ways to estimate Ap and Bo and then perform 
hypothesis tests. 

The two leading examples of m-estimators are the ML and the NLS estimators. 
Formal results for these estimators are given in, respectively, Propositions 5.5 and 5.6. 
Simpler representations of the asymptotic distributions of these estimators are given 
in, respectively, (5.48) and (5.77). 


Poisson ML Example 


Like other ML estimators, the Poisson ML estimator is consistent if the density is 
correctly specified. However, applying (5.25) from Section 5.3.7 to (5.3) reveals that 
the essential condition for consistency is actually the weaker condition that E[y|x] = 
exp(x’G,), that is, correct specification of the mean. Similar robustness of the ML 
estimator to partial misspecification of the distribution holds for some other special 
cases detailed in Section 5.7. 

For the Poisson ML estimator 0q(3)/03 = (y — exp(x’G))x, leading to 


Ao = — plim N~! ) © exp(x//9o)X;x; 
and 
Bo = plim N7! >, V Dilxi] xix). 
Then 3 ~ N[00,N—!A~'BA~!], where A — —N-! X; exp(x,3)x;x! and B= 
N! Y; Oi — exp(x;B)) xix". 
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Table 5.2. Marginal Effect: Three Different Estimates 


Formula Description 

N`! SS, Ely; |x] /0x; Average response of all individuals 

dELy|x]/0xl|5 Response of the average individual 

dE y|x]/0X|,« Response of a representative individual with x = x* 


If the data are actually Poisson distributed, then V[y|x] =E[y|x] = exp(x’ Gy), lead- 
ing to possible simplification since Ag = —Bo so that Ap BoAo l —A9 ' However, 
in most applications with count data V[y|x] > E[y|x], so it is best not to impose this 
restriction. 


5.2.4. Coefficient Interpretation in Nonlinear Regression 


An important goal of estimation is often prediction, rather than testing the statistical 
significance of regressors. 


Marginal Effects 


Interest often lies in measuring marginal effects, the change in the conditional mean 
of y when regressors x change by one unit. 

For the linear regression model, E[y|x] = x3 implies dE[y|x]/dx = (3 so that the 
coefficient has a direct interpretation as the marginal effect. For nonlinear regression 
models, this interpretation is no longer possible. For example, if E[y|x] = exp(x' 8), 
then dE[y|x]/dx = exp(x’ 3) is a function of both parameters and regressors, and the 
size of the marginal effect depends on x in addition to 8. 


General Regression Function 


For a general regression function 


E[y|x] =g(x, 8), 


the marginal effect varies with the evaluation value of x. 

It is customary to present one of the estimates of the marginal effect given in 
Table 5.2. The first estimate averages the marginal effects for all individuals. The sec- 
ond estimate evaluates the marginal effect at x = x. The third estimate evaluates at 
specific characteristics x = x*. For example, x* may represent a person who is female 
with 12 years of schooling and so on. More than one representative individual might be 
considered. 

These three measures differ in nonlinear models, whereas in the linear model they 
all equal 8. Even the sign of the effect may be unrelated to the sign of the pa- 
rameter, with dE[y|x]/0x; positive for some values of x and negative for other val- 
ues of x. Considerable care must be taken in interpreting coefficients in nonlinear 
models. 
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Computer programs and applied studies often report the second of these measures. 
This can be useful in getting a sense for the magnitude of the marginal effect, but 
policy interest usually lies in the overall effect, the first measure, or the effect on a 
representative individual or group, the third measure. The first measure tends to change 
relatively little across different choices of functional form g(-), whereas the other two 
measures can change considerably. One can also present the full distribution of the 
marginal effects using a histogram or nonparametric density estimate. 


Single-Index Models 


Direct interpretation of regression coefficients is possible for single-index models that 
specify 


ELy|x] = g(x’), (5.11) 


so that the data and parameters enter the nonlinear mean function g(-) by way of the 
single index x’. Then nonlinearity is of the mild form that the mean is a nonlinear 
function of a linear combination of the regressors and parameters. For single-index 
models the effect on the conditional mean of a change in the jth regressor using cal- 
culus methods is 
dELy|x] 
OX; 


= g'(x P)Bj, 


where g'(z) = 0g(z)/dz. It follows that the relative effects of changes in regressors 
are given by the ratio of the coefficients since 


dEly|x]/0xj _ Éj 


ƏE[yix]/ðxk Bx’ 
because the common factor g'(x’ 6) cancels. Thus if 6; is two times £p then a one- 
unit change in x; has twice the effect as a one-unit change in xz. If g(-) is additionally 
monotonic then it follows that the signs of the coefficients give the signs of the effects, 
for all possible x. 

Single-index models are advantageous owing to their simple interpretation. Many 
standard nonlinear models such as logit, probit, and Tobit are of single-index form. 
Moreover, some choices of function g(-) permit additional interpretation, notably the 
exponential function considered later in this section and the logistic cdf analyzed in 
Section 14.3.4. 


Finite-Difference Method 


We have emphasized the use of calculus methods. The finite-difference method in- 
stead computes the marginal effect by comparing the conditional mean when x; is 
increased by one unit with the value before the increase. Thus 


AELyIx] _ os 
“hag, OS g(x, B), 


where e; is a vector with jth entry one and other entries zero. 


123 


MAXIMUM LIKELIHOOD AND NONLINEAR LEAST-SQUARES ESTIMATION 


For the linear model finite-difference and calculus methods lead to the same es- 
timated effects, since AE[y|x]/Ax; =(x'G+ B;) — x'B = B;. For nonlinear models, 
however, the two approaches give different estimates of the marginal effect, unless the 
change in x; is infinitesimally small. 

Often calculus methods are used for continuous regressors and finite-difference 
methods are used for integer-valued regressors, such as a (0, 1) indicator variable. 


Exponential Conditional Mean 


As an example, consider coefficient interpretation for an exponential conditional mean 
function, so that E[y|x] = exp(x’B). Many count and duration models use the expo- 
nential form. 

A little algebra yields dE[ y|x]/0x ; =E[y|x] x j. So the parameters can be inter- 
preted as semi-elasticities, with a one-unit change in x; increasing the conditional 
mean by the multiple £j. For example, if 6; = 0.2 then a one-unit change in x; 
is predicted to lead to a 0.2 times proportionate increase in E[y|x], or an increase 
of 20%. 

If instead the finite-difference method is used, the marginal effect is computed as 
AE[y|x]/Ax; = exp(x’3 + B;) — exp(x’B) = exp(x’B)(e" — 1). This differs from the 
calculus result, unless £; is small so that ei ~1+8 j- For example, if 6; = 0.2 the 
increase is 22.14% rather than 20%. 


5.3. Extremum Estimators 


This section is intended for use in an advanced graduate course in microeconomet- 
rics. It presents the key results on consistency and asymptotic normality of extremum 
estimators, a very general class of estimators that minimize or maximize an objective 
function. The presentation is very condensed. A more complete understanding requires 
an advanced treatment such as that in Amemiya (1985), the basis of the treatment here, 
or in Newey and McFadden (1994). 


5.3.1. Extremum Estimators 


For cross-section analysis of a single dependent variable the sample is one of N ob- 
servations, {(y;, X;), i =1,..., N}, on a dependent variable y;, and a column vector 
x; of regressors. In matrix notation the sample is (y, X), where y is an N x 1 vector 
with ith entry y; and X is a matrix with ith row x;, as defined more completely in 
Section 1.6. 

Interest lies in estimating the q x 1 parameter vector 0 =[6)....0,]’. The value 
09, termed the true parameter value, is the particular value of @ in the process that 
generated the data, called the data-generating process. 

We consider estimators @ that maximize over 0 € © the stochastic objective func- 
tion Qy(9) = Qx(y, X, 9), where for notational simplicity the dependence of Qn (0) 
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on the data is indicated only via the subscript N. Such estimators are called extremum 
estimators, since they solve a maximization or minimization problem. 
The extremum estimator may be a global maximum, so 


~a 


0 = arg max ọco Qn (0). (5.12) 


Usually the extremum estimator is a local maximum, computed as the solution to the 
associated first-order conditions 


3On(0) 
30 lə 
where 0Qy(0)/06 is a q x 1 column vector with kth entry 3 Qn(0)/30;. The lo- 
cal maximum is emphasized because it is the local maximum that may be asymp- 
totic normal distributed. The local and global maxima coincide if Q xn(0) is globally 
concave. 

There are two leading examples of extremum estimators. For m-estimators consid- 
ered in this chapter, notably ML and NLS estimators, Q y (0) is a sample average such 
as average of squared residuals. For the generalized method of moments estimator (see 
Section 6.3) Qy(@) is a quadratic form in sample averages. 

For concreteness the discussion focuses on single-equation cross-section regression. 
But the results are quite general and apply to any estimator based on optimization that 
satisfies properties given in this section. In particular there is no restriction to a scalar 
dependent variable and several authors use the notation z; in place of (y;, x;). Then 
Qvn(9) equals Oy (Z, 0) rather than Qx(y, X, 0). 


= 0, (5.13) 


5.3.2. Formal Consistency Theorems 


We first consider parameter identification, introduced in Section 2.5. Intuitively the 
parameter ĝo is identified if the distribution of the data, or feature of the distribution of 
interest, is determined by 0o whereas any other value of 0 leads to a different distribu- 
tion. For example, in linear regression we required E[y|X] = X6 and XB" = xB 
if and only if 8 = B®. 

An estimation procedure may not identify @9. For example, this is the case if the es- 
timation procedure omits some relevant regressors. We say that an estimation method 
identifies 99 if the probability limit of the objective function, taken with respect to 
the dgp with parameter 0 = @o, is maximized uniquely at 0 = 0 . This identification 
condition is an asymptotic one. Practical estimation problems that can arise in a finite 
sample are discussed in Chapter 10. 

Consistency is established in the following manner. As N — oo the stochastic ob- 
jective function Qy(@), an average in the case of m-estimation, may converge in prob- 
ability to a limit function, denoted Qo(@), that in the simplest case is nonstochas- 
tic. The corresponding maxima (global or local) of Qy(@) and Qo(@) should then 
occur for values of @ close to each other. Since the maximum of Qy(8) is (3 by 
definition, it follows that ð converges in probability to 0o provided 0o maximizes 


Qo(0). 
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Clearly, consistency and identification are closely related, and Amemiya (1985, 
p. 230) states that a simple approach is to view identification to mean existence of a 
consistent estimator. For further discussion see Newey and McFadden (1994, p. 2124) 
and Deistler and Seifert (1978). 

Key applications of this approach include Jennrich (1969) and Amemiya (1973). 
Amemiya (1985) and Newey and McFadden (1994) present quite general theorems. 
These theorems require several assumptions, including smoothness (continuity) and 
existence of necessary derivatives of the objective function, assumptions on the dgp 
to ensure convergence of Qy(@) to Qo(0), and maximization of Qo(@) at 0 = O. 
Different consistency theorems use slightly different assumptions. 

We present two consistency theorems due to Amemiya (1985), one for a global 
maximum and one for a local maximum. The notation in Amemiya’s theorems has 
been modified as Amemiya (1985) defines the objective function without the normal- 
ization 1/N present in, for example, (5.4). 


Theorem 5.1 (Consistency of Global Maximum) (Amemiya, 1985, Theo- 
rem 4.1.1): Make the following assumptions: 


(i) The parameter space © is a compact subset of R‘. 


(ii) The objective function Qy(@) is a measurable function of the data for all 0 € 
O, and Qy(8) is continuous in 0 € ©. 


(iii) Qn(@) converges uniformly in probability to a nonstochastic function Qo(@), 
and Qo(9) attains a unique global maximum at 0p. 


Then the estimator 0 = arg maxgc@ Oyn(8) is consistent for Oo, that is, 64 Bo. 


Uniform convergence in probability of Qy(@) to 
Qo(9) = plim Qx (0) (5.14) 


in condition (iii) means that supg.@ |Q n(0) — Qo()| Bp 
For a local maximum, first derivatives need to exist, but one need then only consider 
the behavior of Q (8) and its derivative in the neighborhood of 9. 


Theorem 5.2 (Consistency of Local Maximum) (Amemiya, 1985, Theo- 
rem 4.1.2): Make the following assumptions: 


(i) The parameter space © is an open subset of R4. 


(ii) Qn(@) is a measurable function of the data for all 0 € ©, and AQyn(O)/00 


exists and is continuous in an open neighborhood of 0o. 
(iii) The objective function Qy(@) converges uniformly in probability to Qo(@) in 


an open neighborhood of 0o, and Qo(@) attains a unique local maximum at 00. 


Then one of the solutions to 0 Qn(@)/00 = 0 is consistent for Oo. 
An example of use of Theorem 5.2 is given later in Section 5.3.4. 
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Condition (i) in Theorem 5.1 permits a global maximum to be at the boundary of the 
parameter space, whereas in Theorem 5.2 a local maximum has to be in the interior of 
the parameter space. Condition (ii) in Theorem 5.2 also implies continuity of Q n(0) 
in the open neighborhood of 09, where a neighborhood N(@9) of Oo is open if and 
only if there exists a ball with center 09 entirely contained in N (0o). In both theorems 
condition (iii) is the essential condition. The maximum, global or local, of Qo(@) must 
occur at 0 = 0o. The second part of (iii) provides the identification condition that 09 
has a meaningful interpretation and is unique. 

For a local maximum, analysis is straightforward if there is only one local maxi- 
mum. Then @ is uniquely defined by 0Qy(0)/06|g = 0. When there is more than one 
local maximum, the theorem simply says that one of the local maxima is consistent, 
but no guidance is given as to which one is consistent. It is best in such cases to con- 
sider the global maximum and apply Theorem 5.1. See Newey and McFadden (1994, 
p. 2117) for a discussion. 

An important distinction is made between model specification, reflected in the 
choice of objective function Qy(@), and the actual dgp of (y, X) used in obtaining 
Q (9) in (5.14). For some dgps an estimator may be consistent, whereas for other dgps 
an estimator may be inconsistent. In some cases, such as the Poisson ML and OLS es- 
timators, consistency arises under a wide range of dgps provided the conditional mean 
is correctly specified. In other cases consistency requires stronger assumptions on the 
dgp such as correct specification of the density. 


5.3.3. Asymptotic Normality 


Results on asymptotic normality are usually restricted to the local maximum of Q y (0). 
Then @ solves (5.13), which in general is nonlinear in @ and has no explicit solution 
for ð. Instead, we replace the left-hand side of this equation by a linear function of 0, 
by use of a Taylor series expansion, and then solve for 0. 

The most often used version of Taylor’s theorem is an approximation with a re- 
mainder term. Here we instead consider an exact first-order Taylor expansion. For 
the differentiable function f(-) there always exists a point xt between x and xo such 
that 


F(x) = fa) + fe) — x0), 


where f'(x) = 0f(x)/dx is the derivative of f(x). This result is also known as the 
mean value theorem. 

Application to the current setting requires several changes. The scalar function f(-) 
is replaced by a vector function f(-) and the scalar arguments x, xo, and xt are replaced 
by the vectors 6, 8o, and @*. Then 


of(@) 


£@) = (60) + = 


(6 — 0), (5.15) 


where ðf(0)/30 is a matrix, for some unknown 07 between 6 and 0o, and formally 
0+ differs for each row of this matrix (see Newey and McFadden, 1994, p. 2141). 
For the local extremum estimator the function f(@) = 0Q,(@)/06 is already a first 
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derivative. Then an exact first-order Taylor series expansion around 9 yields 


dQn(O)) _ IQn(A) 0° Qn (9) 
00 |g 00 |g, 3006 


@ — 00), (5.16) 


ot 
where 07Qy(0)/0000' is a q x q matrix with (j, k)th entry 07Qy(0)/90;00;, and 
O” is a point between @ and 6p. 

The first-order conditions set the left-hand side of (5.16) to zero. Setting the right- 
hand side to 0 and solving for (0 — 0o) yields 


a a, (PONO) 
VNO — bo) = (5 


; (5.17) 
A 


| aQn(6) 
A N 30 


where we rescale by JN to ensure a nondegenerate limit distribution (discussed fur- 
ther in the following). 

Result (5.17) provides a solution for @. It is of no use for numerical computation 
of 6, since it depends on 0o and O+, both of which are unknown, but it is fine for 
theoretical analysis. In particular, if it has been established that 8 is consistent for 8 
then the unknown 6* converges in probability to 0o, because it lies between 6 and o 
and by consistency 7) converges in probability to 0o. 

The result (5.17) expresses JN (0 — ĝo) in a form similar to that used to obtain the 
limit distribution of the OLS estimator (see Section 5.2.3). All we need do is assume 
a probability limit for the first term on the right-hand side of (5.17) and a limit normal 
distribution for the second term. 

This leads to the following theorem, from Amemiya (1985), for an extremum esti- 
mator satisfying a local maximum. Again note that Amemiya (1985) defines the ob- 
jective function without the normalization 1/N. Also, Amemiya defines Ag and Bo in 
terms of limE rather than plim. 


Theorem 5.3 (Limit Distribution of Local Maximum) (Amemiya, 1985, The- 
orem 4.1.3): In addition to the assumptions of the preceding theorem for consis- 
tency of the local maximum make the following assumptions: 


(i) 3? On (0)/0000' exists and is continuous in an open convex neighborhood of 
Oo. 


(ii) 3° On(0)/3000' lgt converges in probability to the finite nonsingular matrix 
Ao = plim 0° On (0)/0036'|,, (5.18) 
for any sequence @* such that 0+ + Oo. 
(iii) JN 3Qn(0)/30lo, > NTO, Bol, where 
By = plim [N 3 On(0)/90 x 3 Q.v(8)/36'|,, | (5.19) 
Then the limit distribution of the extremum estimator is 
VN(@ — 00) S NTO, Aj BoA. (5.20) 
where the estimator @ is the consistent solution to ð On(9)/00 = 0. 
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The proof follows directly from the Limit Normal Product Rule (Theorem A.17) 
applied to (5.17). Note that the proof assumes that consistency of @ has already been 
established. The expressions for Ag and Bo given in Table 5.1 are specializations to the 
case Qy(0) = N7! >=; gi(@) with independence over i. 

The probability limits in (5.18) and (5.19) are obtained with respect to the dgp 
for (y, X). In some applications the regressors are assumed to be nonstochastic and 
the expectation is with respect to y only. In other cases the regressors are treated as 
stochastic and the expectations are then with respect to both y and X. 


5.3.4. Poisson ML Estimator Asymptotic Properties Example 


We formally prove consistency and asymptotic normality of the Poisson ML estimator, 
under exogenous stratified sampling with stochastic regressors so that (y;, x;) are inid, 
without necessarily assuming that y; is Poisson distributed. 

The key step to prove consistency is to obtain Qo(3) = plim Qx (6) and verify that 
Qo() attains a maximum at B = Bo. For Qy() defined in (5.1), we have 


Q(B) = plimn~' >, |e? + yixi8 — In yi!) 
=lim N! Ý., [-E [e*?] + Ely:x.8] — E [ln y; n} 
=lim NÝ, [-E [e*?] +E [xe] -E[n y; u} i 


The second equality assumes a law of large numbers can be applied to each term. Since 
(Yi, X;) are inid, the Markov LLN (Theorem A.8) can be applied if each of the expected 
values given in the second line exists and additionally the corresponding (1 + ô)th 
absolute moment exists for some ô > 0 and the side condition given in Theorem A.8 
is satisfied. For example, set 6 = 1 so that second moments are used. The third line 
requires the assumption that the dgp is such that EL y|x] = exp(x' 8o). The first two 
expectations in the third line are with respect to x, which is stochastic. Note that Qo(3) 
depends on both @ and (Jo. Differentiating with respect to 8, and assuming that limits, 
derivatives, and expectations can be interchanged, we get 


oe =-limN)* [Px] +limN' JE [ex] ; 
where the derivative of E[In y!] with respect to 6 is zero since E[In y!] will depend 
on Jo, the true parameter value in the dgp, but not on 8. Clearly, 3 Qo(3)/08 = 0 at 
B = Bo and 7 Qo(B)/3B98' = — lim N7! > Elexp(x; 8)x:x’ | is negative definite, so 
Qo(6) attains a local maximum at 3 = Bo and the Poisson ML estimator is consistent 
by Theorem 5.2. Since here Qy(Q) is globally concave the local maximum equals the 
global maximum and consistency can also be established using Theorem 5.1. 

For asymptotic normality of the Poisson ML estimator, the exact first-order Taylor 
series expansion of the Poisson ML estimator first-order conditions (5.3) yields 


VN(B — Bo) = — [-1-! y Ax] NT! Y Oi — ex, (5.21) 
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for some unknown (3* between B and Bo. Making sufficient assumptions on regressors 
A s p b 

x so that the Markov LLN can be applied to the first term, and using 3* 4 Bo since 

B> Bo, we have 


No >, eP xix > Ay = — lim N! 2 E[e*9ox;x’]. (5.22) 


For the second term in (5.21) begin by assuming scalar regressor x. Then X = (y — 
exp(x 6o))x has mean E[X] = 0, as E[y|x] = exp(x Bo) has already been assumed for 
consistency, and variance V[X] =E[VI ylx]x?]. The Liapounov CLT (Theorem A.15) 
can be applied if the side condition involving a (2 + ô)th absolute moment of y — 
exp(x Bo))x is satisfied. For this example with y > 0 it is sufficient to assume that the 
third moment of y exists, that is, ô = 1, and x is bounded. Applying the CLT gives 


yee (yi — eĥi )xi 


E $ NTO, 1], 
JX; E[Viv:lx:1x?] 


Z 


SO 
NY Oi = ee); SW |0, im N~! YO, E [Vivix], 


assuming the limit in the expression for the asymptotic variance exists. This result can 
be extended to the vector regressor case using the Cramer—Wold device (see Theo- 
rem A.16). Then 


NRY Qi- em, SN |o, By = lim N'Y E [Viy xxx] _ (5.23) 


Thus (5.21) yields VNB — By) & NTO, Ap ‘BoAo |], where Ao is defined in (5.22) 
and Bo is defined in (5.23). 

Note that for this particular example y|x need not be Poisson distributed for the 
Poisson ML estimator to be consistent and asymptotically normal. The essential as- 
sumption for consistency of the Poisson ML estimator is that the dgp is such that 
ELy|x] = exp(x' 6o). 

For asymptotic normality the essential assumption is that V[y|x] exists, though 
additional assumptions on existence of higher moments are needed to permit use 
of LLN and CLT. If in fact V[y|x] =exp(x’G,) then Ap= —Bo and more simply 
JN (B Bo) £ NT0, —Ap 1]. The results for this ML example extend to the LEF 
class of densities defined in Section 5.7.3. 


5.3.5. Proofs of Consistency and Asymptotic Normality 


The assumptions made in Theorems 5.1—5.3 are quite general and need not hold in 
every application. These assumptions need to be verified on a case-by-case basis, in a 
manner similar to the preceding Poisson ML estimator example. Here we sketch out 
details for m-estimators. 

For consistency, the key step is to obtain the probability limit of Qy(0). This is 
done by application of an LLN because for an m-estimator Qy(0) is the average 
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N`! 5°, q:(@). Different assumptions on the dgp lead to the use of different LLNs 
and more substantively to different expressions for Qo(@). 

Asymptotic normality requires assumptions on the dgp in addition to those required 
for consistency. Specifically, we need assumptions on the dgp to enable application of 
an LLN to obtain Ao and to enable application of a CLT to obtain Bo. 

For an m-estimator an LLN is likely to verify condition (ii) of Theorem 5.3 as each 
entry in the matrix 07Q,(@)/0000' is an average since Qy(@) is an average. A CLT 
is likely to yield condition (iii) of Theorem 5.3, since VN 3Qn(6)/ 00|9, has mean 
0 from the informal consistency condition (5.24) in Section 5.3.7 and finite variance 
E[N 0Qn(0)/30 x JOn()/06'|, 1. 

The particular CLT and LLN used to obtain the limit distribution of the estimator 
vary with assumptions about the dgp for (y, X). In all cases the dependent variable is 
stochastic. However, the regressors may be fixed or stochastic, and in the latter case 
they may exhibit time-series dependence. These issues have already been considered 
for OLS in Section 4.4.7. 

The common microeconometrics assumption is that regressors are stochastic with 
independence across observations, which is reasonable for cross-section data from na- 
tional surveys. For simple random sampling, the data (y;, x;) are iid and Kolmogorov 
LLN and Lindeberg—Levy CLT (Theorems A.8 and A.14) can be used. Furthermore, 
under simple random sampling (5.18) and (5.19) then simplify to 


0 qd x 0 
A 6 E (y, ’ ) 
Oo 


3030' 
0 0 
m= | 20 ) 349, x, 8) 


and 


a0 00 


kd 
0o 


where (y, x) denotes a single observation and expectations are with respect to the joint 
distribution of (y, x). This simpler notation is used in several texts. 

For stratified random sampling and for fixed regressors the data (y;, x;) are inid and 
Markov LLN and Liapounov CLT (Theorems A.9 and A.15) need to be used. These 
require moment assumptions additional to those made in the iid case. In the stochastic 
regressors case, expectations are with respect to the joint distribution of (y, x), whereas 
in the fixed regressors case, such as in a controlled experiment where the level of x can 
be set, the expectations in (5.18) and (5.19) are with respect to y only. 

For time-series data the regressors are assumed to be stochastic, but they are also 
assumed to be dependent across observations, a necessary framework to accommo- 
date lagged dependent variables. Hamilton (1994) focuses on this case, which is also 
studied extensively by White (2001a). The simplest treatments restrict the random vari- 
ables (y, x) to have stationary distribution. If instead the data are nonstationary with 
unit roots then rates of convergence may no longer be JN and the limit distributions 
may be nonnormal. 

Despite these important conceptual and theoretical differences about the stochastic 
nature of (y, x), however, for cross-section regression the eventual limit theorem is 
usually of the general form given in Theorem 5.3. 
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5.3.6. Discussion 


The form of the variance matrix in (5.20) is called the sandwich form, with Bo sand- 
wiched between Aj l and Ay. The sandwich form, introduced in Section 4.4.4, will 
be discussed in more detail in Section 5.5.2. 

The asymptotic results can be extended to inconsistent estimators. Then 9 is re- 
placed by the pseudo-true value 0*, defined to be that value of 8 that yields the local 
maximum of Qo(@). This is considered in further detail for quasi-ML estimation in 
Section 5.7.1. In most cases, however, the estimator is consistent and in later chapters 
the subscript 0 is often dropped to simplify notation. 

In the preceding results the objective function Qy(9) is initially defined with nor- 
malization by 1/N, the first derivative of Qn (0) is then normalized by JN, and the 
second derivative is not normalized, leading to a /N-consistent estimator. In some 
cases alternative normalizations may be needed, most notably time series with nonsta- 
tionary trend. 

The results assume that Q,(@) is a continuous differentiable function. This 
excludes some estimators such as least absolute deviations, for which Qy(@) = 
N~!Y°, |y; — x; |. One way to proceed in this case is to obtain a differentiable ap- 
proximating function Q% (0) such that Q%,(@) — On(0) + 0 and apply the preceding 
theorem to Q% (0). 

The key component to obtaining the limit distribution is linearization using a Taylor 
series expansion. Taylor series expansions can be a poor global approximation to a 
function. They work well in the statistical application here as the approximation is 
asymptotically a local one, since consistency implies that for large sample sizes 6 is 
close to the point of expansion ĝo. More refined asymptotic theory is possible using the 
Edgeworth expansion (see Section 11.4.3). The bootstrap (see Chapter 11) is a method 
to empirically implement an Edgeworth expansion. 


5.3.7. Informal Approach to Consistency of an m-Estimator 


For the practitioner the limit normal result of Theorem 5.3 is much easier to prove than 
formal proof of consistency using Theorem 5.1 or 5.2. Here we present an informal 
approach to determining the nature and strength of distributional assumptions needed 
for an m-estimator to be consistent. 

For an m-estimator that is a local maximum, the first-order conditions (5.4) imply 
that ð is chosen so that the average of 0q;(@)/00|g equals zero. Intuitively, a necessary 
condition for this to yield a consistent estimator for 09 is that in the limit the average 
of dq(@)/04|9, goes to 0, or that 


lim dQn(0) 
P 30 


— 1 & | 3q:0) 
-im | 30 


9 i=l 


l =0, (5.24) 
Oo 


where the first equality requires the assumption that a law of large numbers can be 
applied and expectation in (5.24) is taken with respect to the population dgp for (y, X). 
The limit is used as the equality need not be exact, provided any departure from zero 
disappears as N — oo. For example, consistency should hold if the expectation equals 
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1/N. The condition (5.24) provides a very useful check for the practitioner. An infor- 
mal approach to consistency is to look at the first-order conditions for the estimator 
@ and determine whether in the limit these have expectation zero when evaluated at 
0 = 0. 

Even less formally, if we consider the components in the sum, the essential condi- 
tion for consistency is whether for the typical observation 


E [34(0)/301o,] = 0. (5.25) 


This condition can provide a very useful guide to the practitioner. However, it is neither 
a necessary nor a sufficient condition. If the expectation in (5.25) equals 1/N then it 
is still likely that the probability limit in (5.24) equals zero, so the condition (5.25) is 
not necessary. To see that it is not sufficient, consider y iid with mean mo estimated 
using just one observation, say the first observation yı. Then 7 solves yı — u = 0 and 
(5.25) is satisfied. But clearly yı Ea uo as the single observation y; has a variance that 
does not go to zero. The problem is that here the plim in (5.24) does not equal limE. 
Formal proof of consistency requires use of theorems such as Theorem 5.1 or 5.2. 

For Poisson regression use of (5.25) reveals that the essential condition for consis- 
tency is correct specification of the conditional mean of y|x (see Section 5.2.3). Simi- 
larly, the OLS estimator solves N~! >>; xi(i — x; B) = 0, so from (5.25) consistency 
essentially requires that E[x(y — x'Bo)] = 0. This condition fails if E[y|x] 4 x’Go, 
which can happen for many reasons, as given in Section 4.7. In other examples use 
of (5.25) can indicate that consistency will require considerably more parametric as- 
sumptions than correct specification of the conditional mean. 

To link use of (5.24) to condition (iii) in Theorem 5.2, note the following: 


0Q0(8)/00 = 0 (condition (iii) in Theorem 5.2) 
= (plim Qy(0))/00 = 0 (from definition of Qo(@)) 
=> ddimE[Qy(@)])/09? =0 (asan LLN > Qo = plimQy = limE[Qy]) 
=> lim dE[Qy(0)]/00 = 0 (interchanging limits and differentiation), and 
=> lim E[dQj(0)/00] = 0 (interchanging differentiation and expectation). 


The last line is the informal condition (5.24). However, obtaining this result re- 
quires additional assumptions, including restriction to local maximum, application 
of a law of large numbers, interchangeability of limits and differentiation, and in- 
terchangeability of differentiation and expectation (i.e., integration). In the scalar 
case a sufficient condition for interchanging differentiation and limits is limy_.9 
(E[On(@ + h)] — E[Qn(@)) /h =dE[Qn()]/dé uniformly in 8. 


5.4. Estimating Equations 


The derivation of the limit distribution given in Section 5.3.3 can be extended from a 
local extremum estimator to estimators defined as being the solution of an estimating 
equation that sets an average to zero. Several examples are given in Chapter 6. 
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5.4.1. Estimating Equations Estimator 


Let 6 be defined as the solution to the system of q estimating equations 


Pa 1 N A 
hy (@) = ©}; HUY. x, 0) = 0, (5.26) 


where h(-)isaq x 1 vector, and independence over i is assumed. Examples of h(-) are 
given later in Section 5.4.2. 

Since 6 is chosen so that the sample average of h(y, x, 0) equals zero, we expect that 
64 Oo if in the limit the average of h(y, x, 0o) goes to zero, that is, if plimhy(@9) = 
0. If an LLN can be applied this requires that limE[hj(@9)] = 0, or more loosely that 
for the ith observation 


E[h(y;, x;, A0)] = 0. (5.27) 


The easiest way to formally establish consistency is actually to derive (5.26) as the 
first-order conditions for an m-estimator. 

Assuming consistency, the limit distribution of the estimating equations estimator 
can be obtained in the same manner as in Section 5.3.3 for the extremum estimator. 
Take an exact first-order Taylor series expansion of hy(@) around 6o, as in (5.15) with 
f(0) = hy(8), and set the right-hand side to 0 and solve. Then 


dhy (0) 
a0’ 


-1 
VNO — 0) = ( ) J Nhy (00). (5.28) 


ot 


This leads to the following theorem. 


Theorem 5.4 (Limit Distribution of Estimating Equations Estimator): 
Assume that the estimating equations estimator that solves (5.26) is consistent 
for 09 and make the following assumptions: 


(i) dhy(0)/90' exists and is continuous in an open convex neighborhood of 0o. 


(ii) dhy (0)/36'| 4. converges in probability to the finite nonsingular matrix 


. dhy(A) set N dh;(8) 
Ao = pl = plim— 


; (5.29) 
A 


for any sequence 0* such that 0* + bo. 
(iii) /Nhy (80) & NTO, Bo], where 
. n . l1 N N ; 
Bo = plimNhy(8)hy (60) = plim ae ae h;(8o)h; (8o). (5.30) 
Then the limit distribution of the estimating equations estimator is 
VN(@ — 00) 3 NTO, Aj'BoAl |, (5.31) 


where, unlike for the extremum estimator, the matrix Ag may not be symmetric 
since it is no longer necessarily a Hessian matrix. 
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This theorem can be proved by adaptation of Amemiya’s proof of Theorem 5.3. 
Note that Theorem 5.4 assumes that consistency has already been established. 

Godambe (1960) showed that for analysis conditional on regressors the most effi- 
cient estimating equations estimator sets h;(@) = 0 In f(y;|x;, 0)/0@. Then (5.26) are 
the first-order conditions for the ML estimator. 


5.4.2. Analogy Principle 


The analogy principle uses population conditions to motivate estimators. The book 
by Manski (1988a) emphasizes the importance of the analogy principle as a unify- 
ing theme for estimation. Manski (1988a, p. xi) provides the following quote from 
Goldberger (1968, p. 4): 


The analogy principle of estimation ... proposes that population parameters be 
estimated by sample statistics which have the same property in the sample as the 
parameters do in the population. 


Analogue estimators are estimators obtained by application of the analogy prin- 
ciple. Population moment conditions suggest as estimator the solution to the corre- 
sponding sample moment condition. 

Extremum estimator examples of application of the analogy principle have been 
given in Section 4.2. For instance, if the goal of prediction is to minimize expected 
loss in the population and squared error loss is used, then the regression parameters 3 
are estimated by minimizing the sample sum of squared errors. 

Method of moments estimators are also examples. For instance, in the iid case if 
Ely; — u] = Oin the population then we use as estimator f that solves the correspond- 
ing sample moment conditions N~! )°,(y; — u) = 0, leading to i = F, the sample 
mean. 

An estimating equations estimator may be motivated as an analogue estimator. If 
(5.27) holds in the population then estimate 0 by solving the corresponding sample 
moment condition (5.26). 

Estimating equations estimators are extensively used in microeconometrics. The 
relevant theory can be subsumed within that for generalized method of moments, 
presented in the next chapter, which is an extension that permits there to be more 
moment conditions than parameters. In applied statistics the approach is used in the 
context of generalized estimating equations. 


5.5. Statistical Inference 


A detailed treatment of hypothesis tests and confidence intervals is given in Chapter 7. 
Here we outline how to test linear restrictions, including exclusion restrictions, using 
the most common method, the Wald test for estimators that may be nonlinear. Asymp- 
totic theory is used, so formal results lead to chi-square and normal distributions rather 
than the small sample F- and f-distributions from linear regression under normality. 
Moreover, there are several ways to consistently estimate the variance matrix of an 
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extremum estimator, leading to alternative estimates of standard errors and associated 
test statistics and p-values. 


5.5.1. Wald Hypothesis Tests of Linear Restrictions 
Consider testing h linearly independent restrictions, say Ho against Ha, where 


Ho : RO) — r = 0, 
H, : RO — r Æ 0, 


with R an h x q matrix of constants and r an h x 1 vector of constants. For example, 
if 0 = [01 , 62, 63] then to test whether 010o — 62) = 2, R = [1, —1, 0] and r = —2. 

The Wald test rejects Ho if RÔ — r, the sample estimate of RO — r, is signifi- 
cantly different from 0. This requires knowledge of the distribution of RÔ — r. Sup- 


pose JNO — 00) a NTO, Co], where Co= Ap BoA; ' from (5.20). Then 
8 ~ N [00 N7!Co], 
so that under Ho the linear combination 
RO — r © N [0, R(V~'Co)R’], 


where the mean is zero since RO) — r = 0 under Ho. 


Chi-Square Tests 


It is convenient to move from the multivariate normal distribution to the chi-square 
distribution by taking the quadratic form. This yields the Wald statistic 


W= (RO — rY (R(N-'OR’) RÈ =v) S x2(h) (5.32) 


under Ho, where R(N~!Co)R’ is of full rank h under the assumption of linearly inde- 
pendent restrictions, and C is a consistent estimator of Co. Large values of W lead to 
rejection, and Hp is rejected at level œ if W > x2(h) and is not rejected otherwise. 

Practitioners frequently instead use the F-statistic F = W/A. Inference is then based 
on the F(h, N — q) distribution in the hope that this might provide a better finite sam- 
ple approximation. Note that h times the F(h, N) distribution converges to the x7(h) 
distribution as N — oo. 

The replacement of Co by Cin obtaining (5.32) makes no difference asymptotically, 
but in finite samples different C will lead to different values of W. In the case of 
classical linear regression this step corresponds to replacing o? by s?. Then W/h is 
exactly F distributed if the errors are normally distributed (see Section 7.2.1). 


Tests of a Single Coefficient 


Often attention is focused on testing difference from a of a single coefficient, say the 
jth coefficient. Then RO — r = 0; and W = F jl (N~'c;;), where C;; is the jth diagonal 
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element in C. Taking the square root of W yields 


foe Oa ee a (5.33) 
se[6 ; | 


under Ho, where se[6 j] = ,/N~'c;; is the asymptotic standard error of 0). Large val- 
ues of t lead to rejection, and unlike W the statistic t can be used for one-sided tests. 

Formally /W is an asymptotic z-statistic, but we use the notation f as it yields 
the usual “t-statistic,” the estimate divided by its standard error. In finite samples, 
some statistical packages use the standard normal distribution whereas others use the 
t-distribution to compute critical values, p-values, and confidence intervals. Neither is 
exactly correct in finite samples, except in the very special case of linear regression 
with errors assumed to be normally distributed, in which case the t-distribution is 
exact. Both lead to the same results in infinitely large samples as the t-distribution 
then collapses to the standard normal. 


5.5.2. Variance Matrix Estimation 


There are many possible ways to estimate Ag IBAS |. because there are many ways to 
consistently estimate Ag and Bo. Thus different econometrics programs should give the 
same coefficient estimates but, quite reasonably, can give standard errors, t-statistics, 
and p-values that differ in finite samples. It is up to the practitioner to determine the 
method used and the strength of the associated distributional assumptions on the dgp. 


Sandwich Estimate of the Variance Matrix 


The limit distribution of VN NO — o) has yaranar matrix Aj "BoA ' It follows that 
Ë has asymptotic variance matrix N~ "Ao IBAS! , where division by N arises because 
we are considering 6 rather than VN NO — Oo). 

A sandwich estimate of the asymptotic variance of 0 is any estimate of the form 


VA = NABA !, (5.34) 


where A is consistent for Ao and B is consistent for Bo. This is called the sandwich 
form since B is sandwiched between A ~! and A’ ~!. For many estimators A is a 
Hessian matrix so A ~! is symmetric, but this need not always be the case. 

A robust sandwich estimate is a sandwich estimate where the estimate B is con- 
sistent for Bo under relatively weak assumptions. It leads to what are termed robust 
standard errors. A leading example is White’s heteroskedastic-consistent estimate of 
the variance matrix of the OLS estimator (see Section 4.4.5). In various specific con- 
texts, detailed in later sections, robust sandwich estimates are called Huber estimates, 
after Huber (1967); Eicker—White estimates, after Eicker (1967) and White (1980a,b, 
1982); and in stationary time-series applications Newey—West estimates, after Newey 
and West (1987b). 
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Estimation of A and B 


Here we present different estimators for Ag and Bo for both the estimating equa- 
tions estimator that solves hy(@) = 0 and the local extremum estimator that solves 
dQn(O)/00|g = 9. 

Two standard estimates of Ao in (5.29) and (5.18) are the Hessian estimate 


_ On) 
a 0000 


~ hy(@ 
z, — On) 


H= — g , (5.35) 


where the second equality explains the use of the term Hessian, and the expected 
Hessian estimate 


(5.36) 


Aen | 
6 


a0’ 


3 On (0) 
0000" 


6 
The first is analytically simpler and potentially relies on fewer distributional assump- 
tions; the latter is more likely to be negative definite and invertible. 

For Bo in (5.30) or (5.19) it is not possible to use the obvious estimate 
N hy(@)hy(6)’, since this equals zero as @ is defined to satisfy hy) = 0. One es- 
timate is to make potentially strong distributional assumptions to get 


dQn(4) ra 


i (5.37) 


Be = E[Nhy(@)hy(6)']|5 = E [x 3O 30 


Weaker assumptions are possible for m-estimators and estimating equations estimators 
with data independent over i. Then (5.30) simplifies to 


1 N j 
Bo =E E > hOn | i 
since independence implies that, for i Æ j, E[h;h ‘| = E[h,JE[h,’], which in turn 
equals zero given E[h;(@)] = 0. This leads to the outer product (OP) estimate or 
BHHH estimate (after Berndt, Hall, Hall, and Hausman, 1974) 


n 1 Na A, 1 ma ôq) 
Bor = 7 Da WOW OY = 5 0, a 


dqi(O) 
4 90’ 


(5.38) 


Bop requires fewer assumptions than Bz. 

In practice a degrees of freedom adjustment is often used in estimating Bo, with 
division in (5.38) for Bog by (N — q) rather than N, and similar multiplication of Be 
in (5.37) by N/(N — q). There is no theoretical justification for this adjustment in 
nonlinear models, but in some simulation studies this adjustment leads to better finite- 
sample performance and it does coincide with the degrees of freedom adjustment made 
for OLS with homoskedastic errors. No similar adjustment is made for Ai or Ank 

Simplification occurs in some special cases with Ag = — Bo. Leading examples are 
OLS or NLS with homoskedastic errors (see Section 5.8.3) and maximum likelihood 
with correctly specified distribution (see Section 5.6.4). Then either —A-! or B-! may 
be used to estimate the variance of V Ni (0 — 0o). These estimates are less robust to 
misspecification of the dgp than those using the sandwich form. Misspecification of 
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the dgp, however, may additionally lead to inconsistency of @, in which case even 
inference based on the robust sandwich estimate will be invalid. 

For the Poisson example of Section 5.2, Äg = = Ana = =-—N7! X; exp; Bx; x; and 
Bop = = (N —q)! $; (yi — exp(x; By xix, . If V[y|x] = = exp(x’ Gp), the case if re is 
actually Poisson distributed, then Be = = [N /(N — q)lÂ tu and simplification occurs. 


5.6. Maximum Likelihood 


The ML estimator holds special place among estimators. It is the most efficient estima- 
tor among consistent asymptotically normal estimators. It is also important pedagog- 
ically, as many methods for nonlinear regression such as m-estimation can be viewed 
as extensions and adaptations of results first obtained for ML estimation. 


5.6.1. Likelihood Function 
The Likelihood Principle 


The likelihood principle, due to R. A. Fisher (1922), is to choose as estimator of the 
parameter vector 0 that value of 0 that maximizes the likelihood of observing the ac- 
tual sample. In the discrete case this likelihood is the probability obtained from the 
probability mass function; in the continuous case this is the density. Consider the dis- 
crete case. If one value of @ implies that the probability of the observed data occurring 
is .0012, whereas a second value of 0 gives a higher probability of .0014, then the 
second value of 0 is a better estimator. 

The joint probability mass function or density f(y, X|@) is viewed here as a func- 
tion of 0 given the data (y, X). This is called the likelihood function and is denoted 
by Ly (Oly, X). Maximizing Ly(0) is equivalent to maximizing the log-likelihood 
function 


Ly(O) =InLy(6). 


We take the natural logarithm because in application this leads to an objective function 
that is the sum rather than the product of N terms. 


Conditional Likelihood 


The likelihood function Ly (0) = f(y, X10) =f (y|X, 0) f (X|0) requires specification 
of both the conditional density of y given X and the marginal density of X. 

Instead, estimation is usually based on the conditional likelihood function 
Ly (8) =f (y|X, 9), since the goal of regression is to model the behavior of y given 
X. This is not a restriction if f(y|X) and f(X) depend on mutually exclusive sets 
of parameters. When this is the case it is common terminology to drop the adjective 
conditional. For rare exceptions such as endogenous sampling (see Chapters 3 and 
24) consistent estimation requires that estimation is based on the full joint density 
f(y, X|@) rather than the conditional density f(y|X, 0). 
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Table 5.3. Maximum Likelihood: Commonly Used Densities 


Model Range of y Density f(y) Common Parameterization 
Normal (—00, 00) [202] -1/2¢e- 0-H /20° p= x'B,o? =o? 
Bernoulli Oor 1 pA -— p)!-» Logit p= e*8/(1 + eP) 
Exponential (0, 00) rey à = eb or 1/0 = eP 
Poisson 0, 1,2,... er /y! A= eP 


For cross-section data the observations (y;, X;) are independent over i with condi- 
tional density function f(y;|x;, 0). Then by independence the joint conditional density 
fiX, 0) = mà ı f(yilx:, 9), leading to the (conditional) log-likelihood function 


1 N 
On(8) = N'Ly(@) = = 9 1n (OilXi, 8), (5.39) 
i=l 


where we divide by N so that the objective function is an average. 

Results extend to multivariate data, systems of equations, and panel data by re- 
placing the scalar y; by vector y; and letting f(y;|x;,@) be the joint density of y; 
conditional on x;. See also Section 5.7.5. 


Examples 


Across a wide range of data types the following method is used to generate fully 
parametric cross-section regression models. First choose the one-parameter or two- 
parameter (or in some rare cases three-parameter) distribution that would be used for 
the dependent variable y in the iid case studied in a basic statistics course. Then pa- 
rameterize the one or two underlying parameters in terms of regressors x and para- 
meters 0. 

Some commonly used distributions and parameterizations are given in Table 5.3. 
Additional distributions are given in Appendix B, which also presents methods to draw 
pseudo-random variates. 

For continuous data on (—oo, oo), the normal is the standard distribution. The clas- 
sical linear regression model sets u = x’ and assumes o° is constant. 

For discrete binary data taking values 0 or 1, the density is always the Bernoulli, 
a special case of the binomial with one trial. The usual parameterizations for the 
Bernoulli probability lead to the logit model, given in Table 5.3, and the probit model 
with p = ®(x’), where ®(-) is the standard normal cumulative distribution function. 
These models are analyzed in Chapter 14. 

For positive continuous data on (0, oo), notably duration data considered in Chap- 
ters 17-19, the richer Weibull, gamma, and log-normal models are often used in addi- 
tion to the exponential given in Table 5.3. 

For integer-valued count data taking values 0, 1,2, ... (see Chapter 20) the richer 
negative binomial is often used in addition to the Poisson presented in Section 5.2.1. 
Setting à = exp(x’@) ensures a positive conditional mean. 
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For incompletely observed data, censored or truncated variants of these distributions 
may be used. The most common example is the censored normal, which is called the 
Tobit model and is presented in Section 16.3. 

Standard likelihood-based models are rarely specified by making assumptions on 
the distribution of an error term. They are instead defined directly in terms of the 
distribution of the dependent variable. In the special case that y ~ N[x’3,07] we can 
equivalently define y = x3 + u, where the error term u ~ NV [0,07]. However, this 
relies on an additive property of the normal shared by few other distributions. For 
example, if y is Poisson distributed with mean exp(x’3) we can always write y = 
exp(x’3) + u, but the error u no longer has a familiar distribution. 


5.6.2. Maximum Likelihood Estimator 


The maximum likelihood estimator (MLE) is the estimator that maximizes the (con- 
ditional) log-likelihood function and is clearly an extremum estimator. Usually the 
MLE is the local maximum that solves the first-order conditions 


1 daLy(@) 1 3 dIn flx 0) 
N 00 N 30 7 


i=1 


0. (5.40) 


More formally this estimator is the conditional MLE, as it is based on the conditional 
density of y given x, but it is common practice to use the simpler term MLE. 

The gradient vector 0£,(0)/06 is called the score vector, as it sums the first deriva- 
tives of the log density, and when evaluated at 0p it is called the efficient score. 


5.6.3. Information Matrix Equality 


The results of Section 5.3 simplify for the MLE, provided the density is correctly 
specified and is one for which the range of y does not depend on 0. 


Regularity Conditions 


The ML regularity conditions are that 


ain f(vlx,0)] f 3n flx, 8) 7 
ad = f 5 Fx, 8) = 0 (5.41) 


and 


E FOX, o E FOX, 8) 3 ln FOX, 
E; =E i 


3030' 00 a0’ ee 


where the notation E+ [-] is used to make explicit that the expectation is with respect to 
the specified density f(y|x, 0). Result (5.41) implies that the score vector has expected 
value zero, and (5.42) yields (5.44). 

Derivation given in Section 5.6.7 requires that the range of y does not depend on 0 
so that integration and differentiation can be interchanged. 
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Information Matrix Equality 


The information matrix is the expectation of the outer product of the score vector, 


dLy (9) OL (9) 
00 00’ 


T=E | (5.43) 
The terminology information matrix is used as Z is the variance of 0£y(@)/00, since 
by (5.41) 0£y(0)/06 has mean zero. Then large values of Z mean that small changes 
in @ lead to large changes in the log-likelihood, which accordingly contains consider- 
able information about 0. The quantity Z is more precisely called Fisher Information, 
as there are alternative information measures. 

For log-likelihood function (5.39), the regularity condition (5.42) implies that 


2 
n f Ly (8) -e Sgn |: AR 
0 9% 


0000 00 00’ 

if the expectation is with respect to f(y|x, 0o). The relationship (5.44) is called the 
information matrix (IM) equality and implies that the information matrix also equals 
—E[d*L(0)/3000']. The IM equality (5.44) implies that —Ao = Bo, where Ao and 
Bo are defined in (5.18) and (5.19). Theorem 5.3 then simplifies since Ap 'BoAo I> 
—A,' =By!. 

The equality (5.42) is in turn a special case of the generalized information matrix 
equality 


E kere (5.45) 


aln f(y|9) 
30’ | S —Eş [mo poe ; 


00’ 
where m(-) is a vector moment function with Ey [m(y, 0)] = 0 and expectations are 
with respect to the density f(y|@). This result, also obtained in Section 5.6.7, is used 
in Chapters 7 and 8 to obtain simpler forms of some test statistics. 


5.6.4. Distribution of the ML Estimator 


The regularity conditions (5.41) and (5.42) lead to simplification of the general results 
of Section 5.3. 

The essential consistency condition (5.25) is that E[ 3 In f(y|x, 0)/06|9,] = 0. This 
holds by the regularity condition (5.41), provided the expectation is with respect to 
JS (y|x, 90). Thus if the dgp is f(y|x, Oo), that is, the density has been correctly speci- 
fied, the MLE is consistent for 8o. 

For the asymptotic distribution, simplification occurs since —Ap = Bo by the IM 
equality, which again assumes that the density is correctly specified. 

These results can be collected into the following proposition. 


Proposition 5.5 (Distribution of ML Estimator): Make the following assump- 
tions: 


(i) The dgp is the conditional density f (¥;|x;, 90) used to define the likelihood 
function. 
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(ii) The density function f (-) satisfies f(y, 0) = f(y, 0P) iff OY = 6 
(iii) The matrix 
1 0?Ly(0) 


Ao = pli 
o= PN 0080 |p, 


(5.46) 


exists and is finite nonsingular. 


(iv) The order of differentiation and integration of the log-likelihood can be re- 
versed. 


Then the ML estimator Ou defined to be a solution of the first-order conditions 
ƏNT!Ly(0)/30 = 0, is consistent for Oo, and 


VN (Oui — 90) > N[0, —Ag']. (5.47) 


Condition (i) states that the conditional density is correctly specified; conditions 
(i) and (ii) ensure that @o is identified; condition (iii) is analogous to the assumption 
on plim N~!X’X in the case of OLS estimation; and condition (iv) is necessary for the 
regularity conditions to hold. As in the general case probability limits and expectations 
are with respect to the dgp for (y, X), or with respect to just y if regressors are assumed 
to be nonstochastic or analysis is conditional on X. 

Relaxation of condition (i) is considered in detail in Section 5.7. Most ML examples 
satisfy condition (iv), but it does rule out some models such as y uniformly distributed 
on the interval [0, 0] since in this case the range of y varies with 0. Then not only 
does Ag 4 —Bo but the global MLE converges at a rate other than VN and has limit 
distribution that is nonnormal. See, for example, Hirano and Porter (2003). 

Given Proposition 5.5, the resulting asymptotic distribution of the MLE is often 


expressed as 
= 4 PLOT 
OML ~ 0,- [E 4 


where for notational simplicity the evaluation at 0o is suppressed and we assume that 
an LLN applies so that the plim operator in the definition of Ao is replaced by limE 
and then drop the limit. This notation is often used in later chapters. 

The right-hand side of (5.48) is the Cramer—Rao lower bound (CRLB), which from 
basic statistics courses is the lower bound of the variance of unbiased estimators in 
small samples. For large samples, considered here, the CRLB is the lower bound for 
the variance matrix of consistent asymptotically normal (CAN) estimators with con- 
vergence to normality of WAN (0 — 0o) uniform in compact intervals of Ao (see Rao, 
1973, pp. 344-351). Loosely speaking the MLE has the strong attraction of having 
the smallest asymptotic variance among root— N consistent estimators. This result re- 
quires the strong assumption of correct specification of the conditional density. 


5.6.5. Weibull Regression Example 


As an example, consider regression based on the Weibull distribution, which is used to 
model duration data such as length of unemployment spell (see Chapter 17). 
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The density for the Weibull distribution is f(y) = yay*—! exp(—yy*), where y > 0 
and the parameters œ > 0 and y > 0. It can be shown that E[y] = y~!/*T(a7! + 1), 
where I'(-) is the gamma function. The standard Weibull regression model is obtained 
by specifying y = exp(x’), in which case E[y|x] = exp(—x’3/a)T'(a~! + 1). Given 
independence over i the log-likelihood function is 


N7'Ly(0) = N7! yo {x8 + Ina + (a — 1)Iny; — exp(x,3)y*}. 
Differentiation with respect to G and « leads to the first-order conditions 


N! Y; {1 — exp(x,B)y*}x; = 0, 
N-! 7, {4 + Iny; — exp(x,)y% In y;} = 0. 


Unlike the Poisson example, consistency essentially requires correct specification 
of the distribution. To see this, consider the first-order conditions for 3. The informal 
condition (5.25) that E[{1 — exp(x’3)y*}x] = 0 requires that ELy®|x] = exp(—x' 6), 
where the power a is not restricted to be an integer. The first-order conditions for a 
lead to an even more esoteric moment condition on y. 

So we need to proceed on the assumption that the density is indeed Weibull with 
y = exp(x’ Gy) and œ = ap. Theorem 5.5 can be applied as the range of y does not de- 
pend on the parameters. Then, from (5.48), the Weibull MLE is asymptotically normal 
with asymptotic variance 


~ I A = 
B X; e“ Poyx:x, D epy In(y;)x; 
v| |=(-E PAE ais i ae 
| D; —e™ o y; In(y;)x; didi me 


where d; = —(1 /o) — e%Po y; (ln yi)?. The matrix inverse in (5.49) needs to be ob- 
tained by partitioned inversion because the off-diagonal term 07£y(G,a)/330a does 
not have expected value zero. Simplification occurs in models with zero expected 
cross-derivative E[d*Ly(3,a)/3@d0'] = 0, such as regression with normally dis- 
tributed errors, in which case the information matrix is said to be block diagonal 
in Ganda. 


5.6.6. Variance Matrix Estimation for MLE 


There are several ways to consistently estimate the variance matrix of an extremum 
estimator, as already noted in Section 5.5.2. For the MLE additional possibilities arise 
if the information matrix equality is assumed to hold. Then Ay 'BoAo i —A5 ‘and Bo x 
are all asymptotically equivalent, as are the corresponding consistent estimates of these 
quantities. A detailed discussion for the MLE is given in Davidson and MacKinnon 
(1993, chapter 18). 

The sandwich estimate A~'BA™! is called the Huber estimate, after Huber (1967), 
or White estimate, after White (1982), who considered the distribution of the MLE 
without imposing the information matrix equality. The sandwich estimate is in theory 
more robust than —A~! or B~!. It is important to note, however, that the cause of fail- 
ure of the information matrix equality may additionally lead to the more fundamental 
complication of inconsistency of Om. This is the subject of Section 5.7. 
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5.6.7. Derivation of ML Regularity Conditions 


We now formally derive the regularity conditions stated in Section 5.6.3. For notational 
simplicity the subscript i and the regressor vector are suppressed. 
Begin by deriving the first condition (5.41). The density integrates to one, that is, 


f so1ay = 1. 


Differentiating both sides with respect to @ yields a J fQ|@)dy = 0. If the range of 
integration (the range of y) does not depend on @ this implies 


[oe 
a0 
Now 0 In f(y10)/30 = [df (y10)/30]/[f(yl0)], which implies 

IfA _ 3ln f0) 


dy =0. (5.50) 


3 7 30 f(y). (5.51) 
Substituting (5.51) in (5.50) yields 
al 0 
/ a f(y|0)dy =0, (5.52) 


which is (5.41) provided the expectation is with respect to the density f (y|). 
Now consider the second condition (5.42), initially deriving a more general result. 
Suppose 


Elm(y, 8)] = 0, 


for some (possibly vector) function m(-). Then when the expectation is taken with 
respect to the density f(y|@) 


f m(y, 8) f(y|O)dy = 0. (5.53) 


Differentiating both sides with respect to 6’ and assuming differentiation and integra- 
tion are interchangeable yields 


am(y, 0) IFODA 
f (Fe foo +m(y, 0) 50! Jay =0. (5.54) 
Substituting (5.51) in (5.54) yields 
om(y, 0 al 0 
[ (ESE o EEA oo) 635) 
or 
əm(y, 0) | _ ð In f(y|9) 
E ea =-E Ee en] , (5.56) 


when the expectation is taken with respect to the density f(y|@). The regularity con- 
dition (5.42) is the special case m(y, 0) = 0 In f(y|0)/30 and leads to the IM equality 
(5.44). The more general result (5.56) leads to the generalized IM equality (5.45). 
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What happens when integration and differentiation cannot be interchanged? The 
starting point (5.50) no longer holds, as by the fundamental theorem of calculus 
the derivative with respect to O of f f(y|@)dy includes an additional term reflecting 
the presence of a function @ in the range of the integral. Then E[d In f(y|0)/00] 4 0. 

What happens when the density is misspecified? Then (5.52) still holds, but it does 
not necessarily imply (5.41), since in (5.41) the expectation will no longer be with 
respect to the specified density f (y|0). 


5.7. Quasi-Maximum Likelihood 


The quasi-MLE Bom is defined to be the estimator that maximizes a log-likelihood 
function that is misspecified, as the result of specification of the wrong density. Gen- 
erally such misspecification leads to inconsistent estimation. 

In this section general properties of the quasi-MLE are presented, followed by some 
special cases where the quasi-MLE retains consistency. 


5.7.1. Psuedo-True Value 


In principle any misspecification of the density may lead to inconsistency, as then the 
expectation in evaluation of E[d In f(y|x, @)/00|g,] (see Section 5.6.4) is no longer 
with respect to f(y|x, 0o). 

By adaptation of the general consistency proof in Section 5.3.2, the quasi-MLE 
Oom converges in probability to the pseudo-true value 6” defined as 


0* = arg max g-e(plim N~!Ly(6)). (5.57) 


The probability limit is taken with respect to the true dgp. If the true dgp differs 
from the assumed density f(y|x, 0) used to form £y(8), then usually 0* Æ 0o and 
the quasi-MLE is inconsistent. 

Huber (1967) and White (1982) showed that the asymptotic distribution of the 
quasi-MLE is similar to that for the MLE, except that it is centered around 0* and 
the IM equality no longer holds. Then 


VN @om. — 6°) 5 N [0, A*'B*A'], (5.58) 


where A* and B* are as defined in (5.18) and (5.19) except that probability limits 
are taken with respect to the unknown true dgp and are evaluated at 6*. Consistent 
estimates A* and B* can be obtained as in Section 5.5.2, with evaluation at Bomi: 
This distributional result is used for statistical inference if the quasi-MLE retains 
consistency. If the quasi-MLE is inconsistent then usually 6* has no simple interpre- 
tation, aside from that given in the next section. However, (5.58) may still be useful if 
nonetheless there is interest in knowing the precision of estimation. The result (5.58) 
also provides motivation for White’s information matrix test (see Section 8.2.8) and 
for Vuong’s test for discriminating between parametric models (see Section 8.5.3). 
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5.7.2. Kullback—Liebler Distance 


Recall from Section 4.2.3 that if ELy|x] 4 x’, then the OLS estimator can still be 
interpreted as the best linear predictor of E[ y|x] under squared error loss. White (1982) 
proposed a qualitatively similar interpretation for the quasi- MLE. 

Let f(y|@) denote the assumed joint density of y,,..., yy and let h(y) denote the 
true density, which is unknown, where for simplicity dependence on regressors is sup- 
pressed. Define the Kullback—Leibler information criterion (KLIC) 


hy) 
KLIC = E] 1 ——_ : 5.59 
h(E) Pan 


where expectation is with respect to h(y). KLIC takes a minimum value of 0 when 
there is a Oo such that h(y) = f(yl0o), that is, the density is correctly specified, and 
larger values of KLIC indicate greater ignorance about the true density. 

Then the quasi-MLE Oomt minimizes the distance between f(y|@) and h(y), where 
distance is measured using KLIC. To obtain this result, note that under suitable 
assumptions plim N~'Ly(@) = EUn f(y|4)], so Bom converges to 0* that maxi- 
mizes E[In f(y|@)]. However, this is equivalent to minimizing KLIC, since KLIC = 
E[In A(y)] — E[in f (y|@)] and the first term does not depend on @ as the expectation is 
with respect to h(y). 


5.7.3. Linear Exponential Family 


In some special cases the quasi-MLE is consistent even when the density is partially 
misspecified. One well-known example is that the quasi-MLE for the linear regres- 
sion model with normality is consistent even if the errors are nonnormal, provided 
E[y|x] = x’ Gp. The Poisson MLE provides a second example (see Section 5.3.4). 

Similar robustness to misspecification is enjoyed by other models based on densities 
in the linear exponential family (LEF). An LEF density can be expressed as 


fOlu) = expla(u) + bO) + cu) y}, (5.60) 


where we have given the mean parameterization of the LEF, so that u = E[y]. It can 
be shown that for this density Ely] = —[c’()]~!a’(w) and Vy] = [c’(4)]~!, where 
c'(w) = ðc(u)/ðu and a'(u) = da(w)/d. Different functions a(-) and c(-) lead to 
different densities in the family. The term b(y) in (5.60) is a normalizing constant that 
ensures probabilities sum or integrate to one. The remainder of the density exp{a(w) + 
c(u)y} is an exponential function that is linear in y, hence explaining the term linear 
exponential. 

Most densities cannot be expressed in this form. Several important densities are 
LEF densities, however, including those given in Table 5.4. These densities, already 
presented in Table 5.3, are reexpressed in Table 5.4 in the form (5.60). Other LEF 
densities are the binomial with number of trials known (the Bernoulli being a special 
case), some negative binomials models (the geometric and the Poisson being special 
cases), and the one-parameter gamma (the exponential being a special case). 
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Table 5.4. Linear Exponential Family Densities: Leading Examples 


Distribution f(y) = exp{a(-) + bO) + cy} Ely] V] = [o 
Normal (o? known) exp{ 34 — 5 InQzo0?) — "a + fy} H o? 
Bernoulli exp{In(1 — p) + In[p/(1 — p)ly} u=p ul — u) 
Exponential exp{ln à — Ay} u=1/⁄ u? 
Poisson exp{—A — Iny! + ylnà} H=À H 


For regression the parameter u = E[y|x] is modeled as 


u = g(x, B), (5.61) 


for specified function g(-) that varies across models (see Section 5.7.4) depending 
in part on restrictions on the range of y and hence u. The LEF log-likelihood is 
then 


N 
Ls (B) = X lale, B)) + bO) + elei, BY}, (5.62) 


i=l 
with first-order conditions that can be reexpressed, using the aforementioned informa- 
tion on the first-two moments of y, as 


E s 
əƏLy (b) = 3 yi Aai B) x əg(xi, B) = 0, (5.63) 
i=1 l; 


3B ap 


where o? = [c'(g(x;, 3))]~! is the assumed variance function corresponding to the par- 
ticular LEF density. For example, for Bernoulli, exponential, and Poisson, of equals, 
respectively, g;(1 — gi), 1/87, and g;, where g; = g(x;, B). 

The quasi-MLE solves these equations, but it is no longer assumed that the LEF 
density is correctly specified. Gouriéroux, Monfort, and Trognon (1984a) proved that 
the quasi-MLE Bont is consistent provided E[y|x] = g(x, Bo). This is clear from 
taking the expected value of the first-order conditions (5.63), which evaluated at 
B = Bo are a weighted sum of errors y — g(x, Bo) with expected value equal to zero 
if Ely|x] = g(x, 6o). 

Thus the quasi-MLE based on an LEF density is consistent provided only that the 
conditional mean of y given x is correctly specified. Note that the actual dgp for y 
need not be LEF. It is the specified density, potentially incorrectly specified, that is 
LEF. 

Even with correct conditional mean, however, adjustment of default ML output for 
variance, standard errors, and t-statistics based on —Ap. l is warranted. In general the 
sandwich form Aj) 'BoAo | should be used, unless the conditional variance of y given 


x is also correctly specified, in which case Aọ = —Bg. For Bernoulli models, how- 
ever, Ag = —Bo always. Consistent standard errors can be obtained using (5.36) and 
(5.38). 


The LEF is a very special case. In general, misspecification of any aspect of the 
density leads to inconsistency of the MLE. Even in the LEF case the quasi-MLE can 
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be used only to predict the conditional mean whereas with a correctly specified density 
one can predict the conditional distribution. 


5.7.4. Generalized Linear Models 


Models based on an assumed LEF density are called generalized linear models 
(GLMs) in the statistics literature (see the book with this title by McCullagh and 
Nelder, 1989). The class of generalized linear models is the most widely used frame- 
work in applied statistics for nonlinear cross-section regression, as from Table 5.3 it 
includes nonlinear least squares, Poisson, geometric, probit, logit, binomial (known 
number of trials), gamma, and exponential regression models. We provide a short 
overview that introduces standard GLM terminology. 

Standard GLMs specify the conditional mean g(x, Ø) in (5.61) to be of the simpler 
single-index form, so that u = g(x’). Then g~!(jz) = x’, and the function g~!(-) is 
called the link function. For example, the usual specification for the Poisson model 
corresponds to the log-link function since if u = exp(x’@) then In u = x’B. 

The first-order conditions (5.63) become >>; [(y; — gi)/c'(gi)]g:x: = 0, where g; = 
g(x) and g; = g'(x; 6). There are computational advantages in choosing the link 
function so that c’(g(1)) = g'(u), since then these first-order conditions reduce to 
X; Oi — gi)Xi = 9, or the error (y; — g;) is orthogonal to the regressors. The canonical 
link function is defined to be that function g~!(-) which leads to c'(g(u)) = g'(u) and 
varies with c(u) and hence the GLM. The canonical link function leads to u = x’G for 
normal, u = exp(x' 8) for Poisson, and u = exp(x’)/[1 + exp(x’B)] for binary data. 
The last of these is the logit form given earlier in Table 5.3. 

Two times the difference between the maximum achievable log-likelihood and the 
fitted log-likelihood is called the deviance, a measure that generalizes the residual sum 
of squares in linear regression to other LEF regression models. 

Models based on the LEF are very restrictive as all moments depend on just one un- 
derlying parameter, u = g(x’). The GLM literature places some additional structure 
by making the convenient assumption that the LEF variance is potentially misspecified 
by a scalar multiple a, so that V[y|x] = @ x [c’(g(x, B)]~!, where œ 4 1 necessarily. 
For example, for the Poisson model let V[ y|x] = wg(x, B) rather than g(x, G). Given 
such variance misspecification it can be shown that Bp = —a Ao, so the variance matrix 
of the quasi-MLE is —g A5 ' which requires only a rescaling of the nonsandwich ML 
variance matrix —A) ' by multiplication by a. A commonly used consistent estimate 
for æ is @ = (N — K)! X; Oi — 8i)?/67, where 3; = g(x, Boi), 7 = [CEN !, 
and division is by (N — K) rather than N is felt to provide a better estimate in small 
samples. See the preceding references and Cameron and Trivedi (1986, 1998) for fur- 
ther details. 

Many statistical packages include a GLM module that as a default gives standard 
errors that are correct provided V[y|x] = a[c’(g(x, 3))]~!. Alternatively, one can es- 
timate using ML, with standard errors obtained using the robust sandwich formula 
Ap BoA; '. In practice the sandwich standard errors are similar to those obtained us- 
ing the simple GLM correction. Yet another way to estimate a GLM is by weighted 
nonlinear least squares, as detailed at the end of Section 5.8.6. 
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5.7.5. Quasi-MLE for Multivariate Dependent Variables 


This chapter has focused on scalar dependent variables, but the theory applies also to 
the multivariate case. Suppose the dependent variable y is an m x 1 vector, and the data 
(Yi, Xi) i =1,..., N, are independent over i. Examples given in later chapters include 
seemingly unrelated equations, panel data with m observations for the ith individual 
on the same dependent variable, and clustered data where data for the ijth observation 
are correlated over m possible values of j. 

Given specification of f(y|x, 0), the joint density of y =(y1, ..., Ym) conditional on 
x, the fully efficient MLE maximizes N T X; In f(yi|x;, 8) as noted after (5.39). How- 
ever, in multivariate applications the joint density of y can be complicated. A simpler 
estimator is possible given knowledge only of the m univariate densities f;(y;|x, 0), 
j =1,...,m, where y; is the jth component of y. For example, for multivariate count 
data one might work with m independent univariate negative binomial densities for 
each count rather than a richer multivariate count model that permits correlation. 

Consider then the quasi-MLE Bom based on the product of the univariate densities, 
I], fiQilx, 0), that maximizes 


1 N m 
0) = — In lx, 9). .64 
On(8) pape FOylxi 8) (5.64) 
Wooldridge (2002) calls this estimator the partial MLE, since the density has been 
only partially specified. 

The partial MLE is an m-estimator with q; = )> j In f(yij|Xi, 0). The essential con- 
sistency condition (5.25) requires that FI>; Of Oi; |X, 0)/34|4.] = 0. This condi- 
tion holds if the marginal densities f(y;j|X;, 0o) are correctly specified, since then 
El df (Yi; |X, 0)/30| p] = 0 by the regularity condition (5.41). 

Thus the partial MLE is consistent provided the univariate densities f;(y;|x, 0) are 
correctly specified. Consistency does not require that f(y|x, 0) =|] į FID; IX, 0). De- 
pendence of y1, ..., Ym Will lead to failure of the information matrix equality, however, 
so standard errors should be computed using the sandwich form for the variance matrix 
with 


1 N m 3? In Fa 
Mo= > ae me 9090" |p, (5.65) 
0 
1 N m m ð ln Ío ð In fik 
Bo = N ae yam ee 00 |, 00’ |» 
0 0 


where fi; = f(yi;|xi, 0). Furthermore, the partial MLE is inefficient compared to the 
MLE based on the joint density. Further discussion is given in Sections 6.9 and 6.10. 


5.8. Nonlinear Least Squares 


The NLS estimator is the natural extension of LS estimation for the linear model to the 
nonlinear model with ELy|x] = g(x, 3), where g(-) is nonlinear in 8. The analysis and 
results are essentially the same as for linear least squares, with the single change that in 
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Table 5.5. Nonlinear Least Squares: Common Examples 


Model Regression Function g(x, 3) 
Exponential exp(61x1 + Box2 + 3x3) 
Regressor raised to power Bix, + Box g 
Cobb-Douglas production Bix? 2yh i 

CES production [Bix + pox] 
Nonlinear restrictions Bix1 + Box2 + 3x3, where p3 = — b21 


the formulas for variance matrices the regressor vector x is replaced by dg(x, 6)/3 Bla. 
the derivative of the conditional mean function evaluated at G = B. 

For microeconometric analysis, controlling for heteroskedastic errors may be neces- 
sary, as in the linear case. The NLS estimator and extensions that model heteroskedas- 
tic errors are generally less efficient than the MLE, but they are widely used in microe- 
conometrics because they rely on weaker distributional assumptions. 


5.8.1. Nonlinear Regression Model 


The nonlinear regression model defines the scalar dependent variable y to have con- 
ditional mean 


Ely:lx;] = g(i, 9), (5.66) 


where g(-) is a specified function, x is a vector of explanatory variables, and 8 is a 
K x 1 vector of parameters. The linear regression model of Chapter 4 is the special 
case g(x, 3) = x’. 

Common reasons for specifying a nonlinear function for E[y|x] include range re- 
striction (e.g., to ensure that E[y|x] > 0) and specification of supply or demand or 
cost or expenditure models that satisfy restrictions from producer or consumer theory. 
Some commonly used nonlinear regression models are given in Table 5.5. 


5.8.2. NLS Estimator 


The error term is defined to be the difference between the dependent variable 
and its conditional mean, y; — g(x;, 3). The nonlinear least-squares estimator 
Brus minimizes the sum of squared residuals, )~;(y; — g(x;, 3))?, or equivalently 
maximizes 


1 N 
On(B) = -57 D101 — 8%, BY, (5.67) 
i=1 


where the scale factor 1/2 simplifies the subsequent analysis. 
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Differentiation leads to the NLS first-order conditions 


JOn(B)_ 1 Qadgi 
ðB ae 


where g; = g(x;, 8). These conditions restrict the residual (y — g) to be orthogonal to 
dg/0, rather than to x as in the linear case. There is no explicit solution for Bris: 
which instead is computed using iterative methods (given in Chapter 10). 

The nonlinear regression model can be more compactly represented in matrix nota- 
tion. Stacking observations yields 


(yi — gi) = 9, (5.68) 


= eae be es (5.69) 
YN EN UN 
where g; = g(x;, B), or equivalently 
y=g+u, (5.70) 


where y, g, and u are N x 1 vectors with ith entries of, respectively, y;, g;, and u;. 
Then 


1 F 
Qn(6) = an 9 —g)(y-g) 


and 
dQn(B) _ 1 Og’ 
= ; 5.71 
J8 Na TÁ g) (5.71) 
where 
Ogi |), 8N 
ag! OB, opi 
LONE : (5.72) 
on Ogi |, 38N 
OBx OBK 


is the K x N matrix of partial derivatives of g(x, BY with respect to 8. 


5.8.3. Distribution of the NLS Estimator 


The distribution of the NLS estimator will vary with the dgp. The dgp can always be 
written as 


yi = (Xi, Bo) + üi, (5.73) 


a nonlinear regression model with additive error u. The conditional mean is correctly 
specified if EL y|x] = g(x, Bo) in the dgp. Then the error must satisfy E[u|x] = 0. 

Given the NLS first-order conditions (5.68), the essential consistency condition 
(5.25) becomes 


E[dg(x, B)/dBlg, x O — 8i, Bo) = 0. 
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Equivalently, given (5.73), we need E[dg(x, B)/3Bl s, x u] = 0. This holds if 
E[u|x] = 0, so consistency requires correct specification of the conditional mean as 
in the linear case. If instead E[u|x] 40 then consistent estimation requires nonlinear 
instrumental methods (which are presented in Section 6.5). 

The limit distribution of //N (Bris — Qo) is obtained using an exact first-order 
Taylor series expansion of the first-order conditions (5.68). This yields 


-1 


JN@uts — Bo) = - | = te oO ) 
NLS 0 = N a3 0B’ N a 0303" Yi — i 7 
ji = 08; 
JE 2 3 ‘lay. 


for some 3* between Bas and (,. For Ao in (5.18) simplification occurs because 
the term involving (a? g/dB0 p’) drops out since E[u|x] = 0. Thus asymptotically we 
need consider only 


Recs 
NLS 07 — N = 3B a i 


08; 
Ae Bal, 


which is exactly the same as OLS, see Section 4.4.4, except x; is replaced by 
08:/98'|, . 

This yields the following proposition, analogous to Proposition 4.1 for the OLS 
estimator. 


Proposition 5.6 (Distribution of NLS Estimator): Make the following 
assumptions: 
(i) The model is (5.73); that is, yi = g(x;, Bo) + uj. 
(ii) In the dgp E{u;|x;] = 0 and E[uu'|X] = Qo, where Qo,i; = oij. 
(iii) The mean function g(-) satisfies g(x, B®) = g(x, B2) iff BY = B2. 
(iv) The matrix 


ðgi 0g; A 1 dg’ dg 
Ao = plim— zÈ 383P y = plim— (5.74) 
exists and is finite nonsingular. 
w) N-25] | agi /9BXuilg, > N10, Bo], where 
1... ôg âg 1 dg a 
By = plim— )*)* oj N = plim_ 2a, (5.75) 
NA "aB aB | N aB 


Then the NLS estimator Bus defined to be a root of the first-order conditions 
ƏNT! Ovn(B)/dB = 9, is consistent for By and 


VN(Bms — Bo) 5N [0, Ay BoA |] ; (5.76) 
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Conditions (i) to (iii) imply that the regression function is correctly specified and 
the regressors are uncorrelated with the errors and that Gp is identified. The errors can 
be heteroskedastic and correlated over i. Conditions (iv) and (v) assume the relevant 
limit results necessary for application of Theorem 5.3. For condition (v) to be satisfied 
some restrictions will need to be placed on the error correlation over i. The probability 
limits in (5.74) and (5.75) are with respect to the dgp for X; they become regular limits 
if X is nonstochastic. 

The matrices Ag and Bo in Proposition 5.6 are the same as the matrices Myx 
and Myx in Section 4.4.4 for the OLS estimator with x; replaced by 4g;/0(| Bo 
The asymptotic theory for NLS is the same as that for OLS, with this single 
change. 

In the special case of spherical errors, Q = ofl, so Bp = oj Ao and ViByzs] = 
oA, ', Nonlinear least squares is then asymptotically efficient among LS estimators. 
However, cross-section data errors are not necessarily heteroskedastic. 

Given Proposition 5.6, the resulting asymptotic distribution of the NLS estimator 
can be expressed as 


Bus N [6.0D D noD], (5.77) 


where the derivative matrix D = 3g/3£'| A has ith row ag;/3('| a, (see (5.72)), for 
notational simplicity the evaluation at Gp is suppressed, and we assume that an LLN 
applies, so that the plim operator in the definitions of Ap and Bo are replaced by limE, 
and then drop the limit. This notation is often used in later chapters. 


5.8.4. Variance Matrix Estimation for NLS 


We consider statistical inference for the usual microeconometrics situation of inde- 
pendent errors with heteroskedasticity of unknown functional form. This requires a 
consistent estimate of Aj 'BoAj | defined in Proposition 5.6. 

For Ao defined in (5.74) it is straightforward to use the obvious estimator 


> (5.78) 


as Ag does not involve moments of the errors. 
Given independence over i the double sum in Bo defined in (5.75) simplifies to the 
single sum 


= = o. — 
0 p N = L ap a3 


Bo 


As for the OLS estimator (see Section 4.4.5) it is only necessary to consistently esti- 
mate the K x K matrix sum Bo. This does not require consistent estimation of OF, the 
N individual components in the sum. 
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White (1980b) gave conditions under which 


~ Teh a ôðgi dg 1 əæg| x ð 
Bay a aea E È (5.79) 
NZ 0B0B |g N ƏBla ƏB lz 
is consistent for Bo, where w; = y; — g(x;, B), B is consistent for Bg, and 
Q = Diag{a?}. (5.80) 


This leads to the following heteroskedastic-consistent estimate of the asymptotic 
variance matrix of the NLS estimator: 
ViBxis] = D'D)'D' ODO Dy", (5.81) 


where D = dg/ 08'|5. This equation is the same as the OLS result in Section 4.4.5, 


with the regressor matrix X replaced by D. In practice, a degrees of freedom correction 

may be used, so that Bin (5.79) is computed using division by (N — K) rather than by 

N. Then the right-hand side in (5.81) should be multiplied by N/(N — K). 
Generalization to errors correlated over i is given in Section 5.8.7. 


5.8.5. Exponential Regression Example 


As an example, suppose that y given x has exponential conditional mean, so that 
E[y|x] = exp(x’3). The model can be expressed as a nonlinear regression with 
y = exp(x’B) + u, 


where the error term u has E[u|x] = 0 and the error is potentially heteroskedastic. 
The NLS estimator has first-order conditions 


N $, (yi — exp(x;6)) exp(x;6)x; = 0, (5.82) 


so consistency of Brus requires only that the conditional mean be correctly specified 
with E[y|x] = exp(x' 8o). Here 3g/3B = exp(x’3)x, so the general NLS result (5.81) 
yields the heteroskedastic-robust estimate 


as 12 =l DX, IR IR =i 
V[Bnis] = (Oe e~ Paix) >, n e* Px; x’ È, eP xix) i (5.83) 


where T; = y; — exp(x/Byr5)- 


5.8.6. Weighted NLS and FGNLS 


For cross-section data the errors are often heteroskedastic. Then feasible generalized 
NLS that controls for the heteroskedasticity is more efficient than NLS. 

Feasible generalized nonlinear least squares (FGNLS) is still generally less efficient 
than ML. The notable exception is that FGNLS is asymptotically equivalent to the 
MLE when the conditional density for y is an LEF density. A special case is that FGLS 
is asymptotically equivalent to the MLE in the linear regression under normality. 
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Table 5.6. Nonlinear Least-Squares Estimators and Their Asymptotic Variance" 


Estimator Objective Function Estimated Asymptotic Variance 
NLS On(B) = =u'u OD- PAOD) 

FGNLS On(8) = FVG) 'u QD)! 

WNLS On(B) = Hu u DED- DE AS DOS D. 


“ Functions are for a nonlinear regression model with error u = y — g defined in (5.70) and error conditional vari- 
ance matrix Q. D is the derivative of the conditional mean vector with respect to 8’ evaluated at B. For FGNLS 
it is assumed that Q is consistent for Q. For NLS and WNLS the heteroskedastic robust variance matrix uses Q 
equal to a diagonal matrix with squared residuals on the diagonals, an estimate that need not be consistent for Q. 


If heteroskedasticity is incorrectly modeled then the FGNLS estimator retains con- 
sistency but one should then obtain standard errors that are robust to misspecification 
of the model for heteroskedasticity. The analysis is very similar to that for the linear 
model given in Section 4.5. 


Feasible Generalized Nonlinear Least Squares 


The feasible generalized nonlinear least-squares estimator Gpgnis5 Maximizes 


= 1 1 œl 
OQn(B) = -3NI -D9 y -8), (5.84) 


where it is assumed that E[uu’|X] = Q(yọ) and Ẹ is a consistent estimate ¥ of Yọ- 

If the assumptions made for the NLS estimator are satisfied and in fact Q = Q(y9), 
then the FGNLS estimator is consistent and asymptotically normal with estimated 
asymptotic variance matrix given in Table 5.6. The variance matrix estimate is similar 
to that for linear FGLS, XA], except that X is replaced by D = 3g/38' lee 

The FGNLS estimator is the most efficient consistent estimator that minimizes 
quadratic loss functions of the form (y — g)'/V(y — g), where V is a weighting matrix. 

In general, implementation of FGNLS requires inversion of the N x N matrix 
QA). This may be computationally impossible for large N, but in practice Q(4) usu- 
ally has a structure, such as diagonality, that leads to an analytical solution for the 
inverse. 


Weighted NLS 


The FGNLS approach is fully efficient but leads to invalid standard error estimates if 
the model for Qo is misspecified. Here we consider an approach between NLS and 
FGNLS that specifies a model for the variance matrix of the errors but then obtains 
robust standard errors. The discussion mirrors that in Section 4.5.2. 

The weighted nonlinear least squares (WNLS) estimator Bwys maximizes 


1 „a-l 
Qn(b) = an = 2) (y-8), (5.85) 


where X = X(y) is a working error variance matrix, S= (4), where ¥ is an 
estimate of ~y, and, in a departure from FGNLS, © 4 Qo. 
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Under assumptions similar to those for the NLS estimator and assuming that Xo = 
plim S, the WNLS estimator is consistent and asymptotically normal with estimated 
asymptotic variance matrix given in Table 5.6. 

This estimator is called WNLS to distinguish it from FGNLS, which assumed that 
X = Qo. The WNLS estimator hopefully lies between NLS and FGNLS in terms of 
efficiency, though it may be less efficient than NLS if a poor model of the error vari- 
ance matrix is chosen. The NLS and OLS estimators are special cases of WNLS with 
Y=o7l. 


Heteroskedastic Errors 


An obvious working model for heteroskedasticity is o? = E[u?|x;] = exp(z/7o), 
where the vector z is a specified function of x (such as selected subcomponents of 
x) and using the exponential ensures a positive variance. 

Then © = Diag[exp(z;y)] and $ = Diag[exp(z,¥)], where ¥ can be obtained by 
nonlinear regression of squared NLS residuals (y; — g(x;, Bris)? on exp(z,). Since 
X is diagonal, 5X7! = Diag[1 /o?]. Then (5.84) simplifies and the WNLS estimator 
maximizes 


1 & 0i- ga, B 
OnB) = -zy 2 a (5.86) 


i-1 i 


The variance matrix of the WNLS estimator given in Table 5.6 yields 


sa aa o Sia ee e aa 
ViGBwnts] z (£ saa) ( aaa) (£ saa) > (5.87) 

i=l Ci i=l i i=l Ii 
where d; = 02(x;, B) /0B\3 and T; = y; — g(X;, Be) is the residual. In practice 
a degrees of freedom correction may be used, so that the right-hand side of (5.87) 
is multiplied by N/(N — K). If the stronger assumption is made that © = Qo, then 
WNLS becomes FGNLS and 


AGN NS ase Oe si 
V[Brants] = (£ zd; ) : (5.88) 


The WNLS and FGNLS estimators can be implemented using an NLS program. 
First, do NLS regression of y; on g(x;, 3). Second, obtain ¥ by, for example, NLS re- 
gression of (y; — (Xi, Byis)) on exp(z)y) if o? = exp(z;y). Third, perform an NLS 
regression of y;/G; on g(x;, 3)/G;, where G7 = exp(z:¥). This is equivalent to max- 
imizing (5.86). White robust sandwich standard errors from this transformed regres- 
sion give robust WNLS standard errors based on (5.87). The usual nonrobust stan- 
dard errors from this transformed regression give FGNLS standard errors based on 
(5.88). 

With heteroskedastic errors it is very tempting to go one step further and attempt 
FGNLS using N= Diag[a?]. This will give inconsistent parameter estimates of Jo, 
however, as FGNLS regression of y; on g(x;, B) then reduces to NLS regression 
of y;/|u;| on g(x;, B)/[u;|. The technique suffers from the fundamental problem of 
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correlation between regressors and error term. Alternative semiparametric methods 
that enable an estimator as efficient as feasible GLS, without specifying a functional 
form for Qo, are presented in Section 9.7.6. 


Generalized Linear Models 


Implementation of the weighted NLS approach requires a reasonable specification for 
the working matrix. A somewhat ad-hoc approach, already presented, is to let o? = 
exp(ziy), where z is often a subset of x. For example, in regression of earnings on 
schooling and other control variables we might model heteroskedasticity more simply 
as being a function of just a few of the regressors, most notably schooling. 

Some types of cross-section data provide a natural model for heteroskedasticity 
that is very parsimonious. For example, for count data the Poisson density specifies 
that the variance equals the mean, so o? = g(x;, 3). This provides a working model 
for heteroskedasticity that introduces no further parameters than those already used in 
modeling the conditional mean. 

This approach of letting the working model for the variance be a function of the 
mean arises naturally for generalized linear models, introduced in Sections 5.7.3 and 
5.7.4. From (5.63) the first-order conditions for the quasi-MLE based on an LEF den- 
sity are of the form 


= 0, 


N 
= ee B) | dg(%i, P) 
ae a 


where o? = [c'(g(x;, B))]~! is the assumed variance function corresponding to the 
particular GLM (see (5.60)). For example, for Poisson, Bernoulli, and exponential 
distributions o? equals, respectively, g;, g;(1 — g;), and 1/ 8, where g; = g(x;, B). 

These first-order conditions can be solved for 6 in one step that allows for depen- 
dence of oF on 6. In a simpler two-step method one computes o? = c'(g(x;, B) given 
an initial NLS estimate of B and then does a weighted NLS regression of y;/a; on 
9(x;, 3)/G;. The resulting estimator of 8 is asymptotically equivalent to the quasi- 
MLE that directly solves (5.63) (see Gouriéroux, Monfort, and Trognan 1984a, or 
Cameron and Trivedi, 1986). Thus FGNLS is asymptotically equivalent to ML eeraa 
tion when the density is an LEF density. To guard against aspen of o; ? infer- 
ence is based on robust sandwich standard errors, or one lets © o? = alc’(g(x;, BY- L 
where the estimate @ is given in Section 5.7.4. 


5.8.7. Time Series 


The general NLS result in Proposition 5.6 applies to all types of data, including time- 
series data. The subsequent results on variance matrix estimation focused on the cross- 
section case of heteroskedastic errors, but they are easily adapted to the case of time- 
series data with serially correlated errors. Indeed, results on robust variance matrix 
estimation using spectral methods for the time-series case preceded those for the cross- 
section case. 
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The time-series nonlinear regression model is 
yi = ee Pte, t=1,...,T. 


If the error u; is serially correlated it is common to use the autoregressive moving 
average or ARMA (p, q) model 


Ut = piti—1 +--+ + Ppp + Et +O Ep-1 ++ + Og Erg, 


where <; is iid with mean 0 and variance o”, and restrictions may be placed on ARMA 
model parameters to ensure stationarity and invertibility. The ARMA error model im- 
plies a particular structure to the error variance matrix Qo = Q(p, a). 

The ARMA model provides a good model for Qo in the time-series case. In con- 
trast, in the cross-section case, it is more difficult to correctly model heteroskedasticity, 
leading to greater emphasis on robust inference that does not require specification of a 
model for Qo. 

What if errors are both heteroskedastic and serially correlated? The NLS estimator 
is consistent though inefficient if errors are serially correlated, provided x, does not 
include lagged dependent variables in which case it becomes inconsistent. White and 
Domowitz (1984) generalized (5.79) to obtain a robust estimate of the variance matrix 
of the NLS estimator given heteroskedasticity and serial correlation of unknown func- 
tional form, assuming serial correlation of no more than say, /, lags. In practice a minor 
refinement due to Newey and West (1987b) is used. This refinement is a rescaling that 
ensures that the variance matrix estimate is semi-positive definite. Several other refine- 
ments have also been proposed and the assumption of fixed lag length has been relaxed 
so that it is possible for l — œ at a sufficiently slower rate than N — oo. This permits 
an AR component for the error. 


5.9. Example: ML and NLS Estimation 


Maximum likelihood and NLS estimation, standard error calculation, and coefficient 
interpretation are illustrated using simulation data. 


5.9.1. Model and Estimators 


The exponential distribution is used for continuous positive data, notably duration data 
studied in Chapter 17. The exponential density is 


f(y) =ae’, y>0, A>O, 
with mean 1/A and variance 1/7. We introduce regressors into this model by setting 
à = exp(x’), 
which ensures à > 0. Note that this implies that 
E[y|x] =exp(—x’). 
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An alternative parameterization instead specifies E[y|x] =exp(x’3), so that à = 
exp(—x’@). Note that the exponential is used in two different ways: for the density 
and for the conditional mean. 

The OLS estimator from regression of y on x is inconsistent, since it fits a straight 
line when the regression function is in fact an exponential curve. 

The MLE is easily obtained. The log-density is In f(y|x) = x’G — y exp(x' 6), lead- 
ing to ML first-order conditions N~! Xa — yi exp(x’B))x; = 0, or 


a> yi — exp(—x’B) 
— Xi =F 0. 
exp(—x' 8) 
To perform NLS regression, note that the model can also be expressed as a nonlinear 
regression with 


y = exp(—x’B) + u, 


where the error term u has E[u|x] = 0, though it is heteroskedastic. The first-order 
conditions for an exponential conditional mean for this model, aside from a sign rever- 
sal, have already been given in (5.82) and clearly lead to an estimator that differs from 
the MLE. 

As an example of weighted NLS we suppose that the error variance is propor- 
tional to the mean. Then the working variance is V[y] = E[y] and weighted least 
squares can be implemented by NLS regression of y;/@; on exp(—x,3)/o;, where 
7 = exp(—x; Bris). This estimator is less efficient than the MLE and may or may not 
be more efficient than NLS. 

Feasible generalized NLS can be implemented here, since we know the dgp. 
Since V[y] = 1/4? for the exponential density, so the variance equals the mean 
squared, it follows t that V[u|x] = [exp(— —x' B). The FGNLS estimator estimates of 
by o? = [exp(—x; Bres) and can be implemented by NLS regression of y;/@; on 
exp(—x’3)/a;. In general FGNLS is less efficient than the MLE. In this example it is 
actually fully efficient as the exponential density is an LEF density (see the discussion 
at the end of Section 5.8.6). 


5.9.2. Simulation and Results 


For simplicity we consider regression on an intercept and a regressor. The data- 
generating process is 


y|x ~ exponential[A], 
A = exp(P) + 62x), 


where x ~ N/[1, 17] and (81, 62) = (2, —1). A large sample of size 10,000 was drawn 
to minimize differences in estimates, particularly standard errors, arising from sam- 
pling variability. For the particular sample of 10,000 drawn here the sample mean of 
y is 0.62 and the sample standard deviation of y is 1.29. 

Table 5.7 presents OLS, ML, NLS, WNLS, and FGNLS estimates. Up to three 
different standard error estimates are also given. The default regression output yields 
nonrobust standard errors, given in parentheses. For OLS and NLS estimators these 
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Table 5.7. Exponential Example: Least-Squares and ML Estimates* 


Estimator 
Variable OLS ML NLS WNLS FGNLS 
Constant —0.0093 1.9829 1.8876 1.9906 1.9840 
(0.0161) (0.0141) (0.0307) (0.0225) (0.0148) 
[0.0172] [0.0144] [0.1421] [0.0359] [0.0146] 
{0.2110} 
x 0.6198 —0.9896 —0.9575 —0.9961 —0.9907 
(0.0113) (0.0099) (0.0097) (0.0098) (0.0100) 
[0.0254] [0.0099] [0.0612] [0.0224] [0.0101] 
{0.0880} 
InL - —208.71 —232.98 —208.93 —208.72 
R? 0.2326 0.3906 0.3913 0.3902 0.3906 


^ All estimators are consistent, aside from OLS. Up to three alternative standard error estimates are given: 
nonrobust in parentheses, robust outer product in square brackets, and an alternative robust estimate for NLS 
in braces. The conditional dgp is an exponential distribution with intercept 2 and slope parameter — 1. Sample 
size N = 10,000. 


assume iid errors, an erroneous assumption here, and for the MLE these impose the 
IM equality, a valid assumption here since the assumed density is the dgp. The robust 
standard errors, given in square brackets, use the robust sandwich variance estimate 
NAG =BopAg! , where Bop is the outer product estimated given in (5.38). These 
estimates are heteroskedastic consistent. For standard errors of the NLS estimator an 
alternative better estimate is given in braces (and is explained in the next section). The 
standard error estimates presented here use numerical rather than analytical derivatives 
in computing A and B. 


5.9.3. Comparison of Estimates and Standard Errors 


The OLS estimator is inconsistent, yielding estimates unrelated to (61, 62) in the ex- 
ponential dgp. 

The remaining estimators are consistent, and the ML, NLS, WNLS, and FGNLS 
estimators are within two standard errors of the true parameter values of (2, —1), where 
the robust standard errors need to be used for NLS. The FGNLS estimates are quite 
close to the ML estimates, a consequence of using a dgp in the LEF. 

For the MLE the nonrobust and robust ML standard errors are quite similar. This is 
expected as they are asymptotically equivalent (since the information matrix equality 
holds if the MLE is based on the true density) and the sample size here is large. 

For NLS the nonrobust standard errors are invalid, because the dgp has het- 
eroskedastic errors, and greatly overstate the precision of the NLS estimates. The for- 
mula for the robust variance matrix estimate for NLS is given in (5.81), where Q= 
Diag[a?]. An alternative that uses Q = Diag[E [u? |]. where E [u?] = [exp(—x’8)/, 
is given in braces. The two estimates differ: 0. pole compared to 0.0880 for the 
slope coefficient. The difference arises because uw = (yi — exp(x; B)y differs from 
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[exp(—x/ 8). More generally standard errors estimated using the outer product (see 
Section 5.5.2) can be biased even in quite large samples. NLS is considerably less effi- 
cient than MLE, with standard errors many times those of the MLE using the preferred 
estimates in braces. 

The WNLS estimator does not use the correct model for heteroskedasticity, so the 
nonrobust and robust standard errors again differ. Using the robust standard errors the 
WNLS estimator is more efficient than NLS and less efficient than the MLE. 

In this example the FGNLS estimator is as efficient as the MLE, a consequence 
of the known dgp being in the LEF. The results indicate this, with coefficients and 
standard errors very close to those for the MLE. The robust and nonrobust standard 
errors for the FGNLS estimator are essentially the same, as expected since here the 
model for heteroskedasticity is correctly specified. 

Table 5.7 also reports the estimated log-likelihood, InL = [xB = 
exp(—x’/3)y;], and an R-squared measure, R* = 1 — X; Oi -A/O y, 
where 9; = exp(—x/ 3), evaluated at the ML, NLS, WNLS, and FGNLS estimates. 
The R? differs little across models and is lowest for the NLS estimator, as expected 
since NLS minimizes )~,(y; — 3;)”. The log-likelihood is maximized by the MLE, as 
expected, and is considerably lower for the NLS estimator. 


5.9.4. Coefficient Interpretation 


Interest lies in changes in E[y|x] when x changes. We consider the ML estimates of 
Bo = —0.99 given in Table 5.7. 

The conditional mean exp(—f; — 62x) is of single-index form, so that if an ad- 
ditional regressor z with coefficient 63; were included, then the marginal effect of a 
one-unit change in z would be B; /B> times that of a one-unit change in x (see Sec- 
tion 5.2.4). 

The conditional mean is monotonically decreasing in x, so the sign of Bo is the re- 
verse of the marginal effect (see Section 5.2.4). Here the marginal effect of an increase 
in x is an increase in the conditional mean, since B> is negative. 

We now consider the magnitude of the marginal effect of changes in x using cal- 
culus methods. Here dE[y|x]/dx = —f2 exp(—x’Q) varies with the evaluation point 
x and ranges from 0.01 to 19.09 in the sample. The sample-average response is 
0.9907! 5, exp(x’) = 0.61. The response evaluated at the sample mean of x, 
0.99 exp(X’B) = 0.37, is considerably smaller. Since dE[y|x]/dx = —f2E[y|x], yet 
another estimate of the marginal effect is 0.99) = 0.61. 

Finite-difference methods lead to a different estimated marginal effect. For Ax = 1 
we obtain AE[y|x] = (ef? — 1) exp(—x’) (see Section 5.2.4). This yields an average 
response over the sample of 1.04, rather than 0.61. The finite-difference and calculus 
methods coincide, however, if Ax is small. 

The preceding marginal effects are additive. For the exponential conditional mean 
we can also consider multiplicative or proportionate marginal effects (see Sec- 
tion 5.2.4). For example, a 0.1-unit change in x is predicted to lead to a proportionate 
increase in E[y|x] of 0.1 x 0.99 or a 9.9% increase. Again a finite-difference approach 
will yield a different estimate. 
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Which of these measures is most useful? The restriction to single-index form is 
very useful as the relative impact of regressors can be immediately calculated. For the 
magnitude of the response it is most accurate to compute the average response across 
the sample, using noncalculus methods, of a c-unit change in the regressor, where 
the magnitude of c is a meaningful amount such as a one standard deviation change 
in x. 

Similar calculations can be done for the NLS, WNLS, and FGNLS estimates, with 
similar results. For the OLS estimator, note that the coefficient of x can be interpreted 
as giving the sample-average marginal effect of a change in x (see Section 4.7.2). Here 
the OLS estimate Bə = 0.61 equals to two decimal places the sample-average response 
computed earlier using the exponential MLE. Here OLS provides a good estimate of 
the sample-average marginal response, even though it can provide a very poor estimate 
of the marginal response for any particular value of x. 


5.10. Practical Considerations 


Most econometrics packages provide simple commands to obtain the maximum like- 
lihood estimators for the standard models introduced in Section 5.6.1. For other den- 
sities many packages provide an ML routine to which the user provides the equation 
for the density and possibly first derivatives or even second derivatives. Similarly, for 
NLS one provides the equation for the conditional mean to an NLS routine. For some 
nonlinear models and data sets the ML and NLS routines provided in packages can en- 
counter computational difficulties in obtaining estimates. In such circumstances it may 
be necessary to use more robust optimization routines provided as add-on modules to 
Gauss, Matlab and OX. Gauss, Matlab and OX are better tools for nonlinear modeling, 
but require a higher initial learning investment. 

For cross-section data it is becoming standard to use standard errors based on the 
sandwich form of the variance matrix. These are often provided as a command option. 
For LS estimators this gives heteroskedastic-consistent standard errors. For maximum 
likelihood one should be aware that misspecification of the density can lead to incon- 
sistency in addition to requiring the use of sandwich errors. 

The parameters of nonlinear models are usually not directly interpretable, and it is 
good practice to additionally compute the implied marginal effects caused by changes 
in regressors (see Section 5.2.4). Some packages do this automatically; for others sev- 
eral lines of postestimation code using saved regression coefficients may be needed. 


5.11. Bibliographic Notes 


A brief history of the development of asymptotic theory results for extremum estimators is 
given in Newey and McFadden (1994, p. 2115). A major econometrics advance was made 
by Amemiya (1973), who developed quite general theorems that were applied to the Tobit 
model MLE. Useful book-length treatments include those by Gallant (1987), Gallant and White 
(1987), Bierens (1993), and White (1994, 2001a). Statistical foundations are given in many 
books, including Amemiya (1985, Chapter 3), Davidson and MacKinnon (1993, Chapter 4), 
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Greene (2003, appendix D), Davidson (1994), and Zaman (1996). 


5.3 


5.4 


5.5 
5.6 


5.7 


5.8 


The presentation of general extremum estimation results draws heavily on Amemiya (1985, 
Chapter 4), and to a lesser extent on Newey and McFadden (1994). The latter reference is 
very comprehensive. 

The estimating equations approach is used in the generalized linear models literature (see 
McCullagh and Nelder, 1989). Econometricians subsume this in generalized method of 
moments (see Chapter 6). 

Statistical inference is presented in detail in Chapter 7. 

See the pioneering article by Fisher (1922) for general results for ML estimation, including 
efficiency, and for comparison of the likelihood approach with the inverse-probability or 
Bayesian approach and with method of moments estimation. 

Modern applications frequently use the quasi-ML framework and sandwich estimates of 
the variance matrix (see White, 1982, 1994). In statistics the approach is called generalized 
linear models, with McCullagh and Nelder (1989) a standard reference. 

Similarly for NLS estimation, sandwich estimates of the variance matrix are used that re- 
quire relatively weak assumptions on the error process. The papers by White (1980a,c) had 
a big impact on statistical inference in econometrics. Generalization and a detailed review 
of the asymptotic theory is given in White and Domowitz (1984). Amemiya (1983) has 
extensively surveyed methods for nonlinear regression. 


Exercises 


5-1 Suppose we obtain model estimates that yield predicted conditional mean 


E[yx] = exp(1 + 0.01x)/[1 + exp(1 + 0.01x)]. Suppose the sample is of size 100 
and x takes integer values 1, 2,..., 100. Obtain the following estimates of the 
estimated marginal effect dE[y| x]/ax. 


(a) The average marginal effect over all observations. 

(b) The marginal effect of the average observation. 

(c) The marginal effect when x = 90. 

(d) The marginal effect of a one-unit change when x = 90, computed using the 
finite-difference method. 


5-2 Consider the following special one-parameter case of the gamma distribution, 


f(y) = (y/a?) exp (—y/A), y > 0, A > O. For this distribution it can be shown that 

E[y] = 2a and V[y] = 2a2. Here we introduce regressors and suppose that in the 

true model the parameter A depends on regressors according to à; = exp(x; 6)/2. 

Thus E[y;|x;] = exp(x;3) and V[y;|x;] = [exp(x’3)]?/2. Assume the data are inde- 

pendent over / and x; is nonstochastic and G = 6o in the dgp. 

(a) Show that the log-likelihood function (scaled by N-t) for this gamma model 
is Q(B) = N! >>; {In y; — 2x, 8 + 2In2 — 2y; exp(—x;)}. 

(b) Obtain plim Qy(). You can assume that assumptions for any LLN used are 
satisfied. [Hint: E[In y] depends on B but not 8.] 

(c) Prove that B that is the local maximum of Qy() is consistent for Bọ. State 
any assumptions made. 

(d) Now state what LLN you would use to verify part (b) and what additional 
information, if any, is needed to apply this law. A brief answer will do. There 
is no need for a formal proof. 
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Continue with the gamma model of Exercise 5-2. 

(a) Show that 3 Qu(6)/36 = N X; 2[(y — exp(x;B))/ exp(x;3)]x:. 

(b) What essential condition indicated by the first-order conditions needs to be 
satisfied for B to be consistent? 

(c) Apply a central limit theorem to obtain the limit distribution of Na Qn/3 bloo- 
Here you can assume that the assumptions necessary for a CLT are satisfied. 

(d) State what CLT you would use to verify part (c) and what additional informa- 
tion, if any, is needed to apply this law. A brief answer will do. There is no 
need for a formal proof. 

(e) Obtain the probability limit of 3 Qu/363P' lø,- 

(f) Combine the previous results to obtain the limit distribution of JNB — Bo): 

(g) Given part (f), state how to test Ho : Boj = Bł against Ha : Boj < Bj at level 
0.05, where £; is the jth component of 6. 

A nonnegative integer variable y that is geometric distributed has density (or 

more formally probability mass function) f(y) = (y+ 1)(2a)/(1 + 24)-(%°5), y= 

0,1,2,...,4 > 0. Then E[y] = à and V[y = A(1 + 2A). Introduce regressors and 

suppose y; = exp(x;3). Assume the data are independent over / and x; is non- 

stochastic and 6 = 68x in the dgp. 

(a) Repeat Exercise 5-2 for this model. 

(b) Repeat Exercise 5-3 for this model. 

Suppose a sample yields estimates 61 = 5, 02 = 3, se[94] = 2, and se[92] = 1 and 

the correlation coefficient between Bi and O> equals 0.5. Perform the following 

tests at level 0.05, assuming asymptotic normality of the parameter estimates. 

(a) Test Ho : 6; =0 against Ha : 0,40. 

(b) Test Ho : 0; = 262 against Ha : 6; # 202. 

(c) Test Ho : 6; = 0, 62 = 0 against Ha : at least one of 6;, 02 Æ 0. 

Consider the nonlinear regression model y = exp (x’3)/[1 + exp (x’B)] + u, where 

the error term is possibly heteroskedastic. 

(a) Within what range does this restrict E[y|x] to lie? 

(b) Give the first-order conditions for the NLS estimator. 

(c) Obtain the asymptotic distribution of the NLS estimator using result (5.77). 


This question presumes access to software that allows NLS and ML estimation. 
Consider the gamma regression model of Exercise 5-2. An appropriate gamma 
variate can be generated using y= —Alnr, — à In r2, where A = exp (x’B)/2 and 
r4 and re are random draws from Uniform[0, 1]. Let x’/G = 6; + Box. Generate a 
sample of size 1,000 when £81 = —1.0 and Bo = 1 and x ~N/[0, 1]. 

(a) Obtain estimates of 64 and £2 from NLS regression of y on exp(A; + 82X). 
(b) Should sandwich standard errors be used here? 

(c) Obtain ML estimates of 6; and £2 from NLS regression of yon exp(f1 + 2X). 
(d) Should sandwich standard errors be used here? 
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CHAPTER 6 


Generalized Method of Moments 
and Systems Estimation 


6.1. Introduction 


The previous chapter focused on m-estimation, including ML and NLS estimation. 
Now we consider a much broader class of extremum estimators, those based on method 
of moments (MM) and generalized method of moments (GMM). 

The basis of MM and GMM is specification of a set of population moment condi- 
tions involving data and unknown parameters. The MM estimator solves the sample 
moment conditions that correspond to the population moment conditions. For exam- 
ple, the sample mean is the MM estimator of the population mean. In some cases there 
may be no explicit analytical solution for the MM estimator, but numerical solution 
may still be possible. Then the estimator is an example of the estimating equations 
estimator introduced briefly in Section 5.4. 

In some situations, however, MM estimation may be infeasible because there are 
more moment conditions and hence equations to solve than there are parameters. A 
leading example is IV estimation in an overidentified model. The GMM estimator, due 
to Hansen (1982), extends the MM approach to accommodate this case. 

The GMM estimator defines a class of estimators, with different GMM estimators 
obtained by using different population moment conditions, just as different specified 
densities lead to different ML estimators. We emphasize this moment-based approach 
to estimation, even in cases where alternative presentations are possible, as it provides 
a unified approach to estimation and can provide an obvious way to extend methods 
from linear to nonlinear models. 

The basics of GMM estimation are given in Sections 6.2 and 6.3, which present, 
respectively, expository examples and asymptotic results for statistical inference. The 
remainder of the chapter details more specialized estimators. Instrumental variables 
estimators are presented in Sections 6.4 and 6.5. For linear models the treatment in 
Sections 4.8 and 4.9 may be sufficient, but extension to nonlinear models uses the 
GMM approach. Section 6.6 covers methods to compute standard errors of sequential 
two-step m-estimators. Sections 6.7 and 6.8 present the minimum distance estimator, 
a variant of GMM, and the empirical likelihood estimator, an alternative estimator to 
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GMM. Systems estimation methods, used in a relatively small fraction of microecono- 
metrics studies, are discussed in Sections 6.9 and 6.10. 

This chapter reviews many estimation methods from a GMM perspective. Applica- 
tions of these methods to actual data include a linear IV application in Section 4.9.6 
and a linear panel GMM application in Section 22.3. 


6.2. Examples 


GMM estimators are based on the analogy principle (see Section 5.4.2) that population 
moment conditions lead to sample moment conditions that can be used to estimate 
parameters. This section provides several leading applications of this principle, with 
properties of the resulting estimator deferred to Section 6.3. 


6.2.1. Linear Regression 


A classic example of method of moments is estimation of the population mean when 
y is iid with mean u. In the population 


Ely — u] =0. 


Replacing the expectations operator E[-] for the population by the average operator 
N! yO for the sample yields the corresponding sample moment 


33 
— or-o. 
N 


Solving for u leads to the estimator fym = N7! J; yi = J. The MM estimate of the 
population mean is the sample mean. 

This approach can be extended to the linear regression model y = x'ßB + u, where 
x and 8 are K x 1 vectors. Suppose the error term u has zero mean conditional on 
regressors. The single conditional moment restriction E[u|x] = 0 leads to K uncondi- 
tional moment conditions E[xu] = 0, since 


E[xu] = E,[E[xu|x]] = Ex[xE[u|x]] = Ex[x-0] = 0, (6.1) 


using the law of iterated expectations (see Section A.8) and the assumption that 
E[u|x] = 0. Thus 


E[x(y — x’B)] = 0, 


if the error has conditional mean zero. The MM estimator is the solution to the corre- 
sponding sample moment condition 


i ee ; 
— So x0 -x)= 0. 
N i=1 


This yields Bum = (xx) 3 ee 
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The OLS estimator is therefore a special case of MM estimation. The MM deriva- 
tion of the OLS estimator, however, differs significantly from the usual one of mini- 
mization of a sum of squared residuals. 


6.2.2. Nonlinear Regression 


For nonlinear regression the method of moments approach reduces to NLS if regres- 
sion errors are additive. For more general nonlinear regression with nonadditive errors 
(defined in the following) method of moments yields a consistent estimator whereas 
NLS is inconsistent. 

From Section 5.8.3 the nonlinear regression model with additive error is a model 
that specifies 


y = g(x, B) +u. 


A moment approach similar to that for the linear model yields that E[u|x] = 0 im- 
plies that E[h(x)(y — x’)] = 0, where h(x) is any function of x. The particular choice 
h(x) = d(x, G)/d, motivated in Section 6.3.7, leads to corresponding sample mo- 
ment condition that equals the first-order conditions for the NLS estimator given in 
Section 5.8.2. 

The more general nonlinear regression model with nonadditive error specifies 


u=r(y,x, 9), 


where again E[u|x] = 0 but now y is no longer restricted to being an additive func- 
tion of u. For example, in Poisson regression one may define the standardized error 
u = [y — exp(x’G)]/[exp (x’B)]'/” that has E[u|x] = 0 and V[u|x] = 1 since y has 
conditional mean and variance equal to exp (x’ 3). 

The NLS estimator is inconsistent given nonadditive error. Minimizing 
NYS, u;i? = N7! Ù, ri, Xi, BY leads to first-order conditions 


1 D Iri, Xi, B) 
i=1 0p 


Here y; appears in both terms in the product and there is no guarantee that this prod- 
uct has expected value of zero even if E[r(-)|x] = 0. This inconsistency did not arise 
with additive errors r(-) = y — g(x, 9), as then dr(-)/08 = —dg(x, B)/dG, so only 
the second term in the product depended on y. 

A moment-based approach yields a consistent estimator. The assumption that 
E[u|x] = 0 implies 


r(yi, Xi, 8) = 0. 


E[h(x)r(y, x, 8)] = 0, 


where h(x) is a function of x. If dim[h(x)] = K then the corresponding sample mo- 
ment 


1 N 
HW Zhen) =0 


yields a consistent estimate of 3, where solution is by numerical methods. 
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6.2.3. Maximum Likelihood 


The Kullback—Leibler information criterion was defined in Section 5.7.2. From 
this definition, a local maximum of KLIC occurs if E[s(@)]= 0, where s(@) = 
ð In f(x, 0)/30 and f(y|x, 9) is the conditional density. 

Replacing population moments by sample moments yields an estimator @ that 
solves N7! >; s:(0) = 0. These are the ML first-order conditions, so the MLE can 
be motivated as an MM estimator. 


6.2.4. Additional Moment Restrictions 


Using additional moments can improve the efficiency of estimation but requires adap- 
tation of regular method of moments if there are more moment conditions than param- 
eters to estimate. 

A simple example of an inefficient estimator is the sample mean. This is an ineffi- 
cient estimator of the population mean unless the data are a random sample from the 
normal distribution or some other member of the exponential family of distributions. 
One way to improve efficiency is to use alternative estimators. The sample median, 
consistent for u if the distribution is symmetric, may be more efficient. Obviously the 
MLE could be used if the distribution is fully specified, but here we instead improve 
efficiency by using additional moment restrictions. 

Consider estimation of 8 in the linear regression model. The OLS estimator is in- 
efficient even assuming homoskedastic errors, unless errors are normally distributed. 
From Section 6.2.1, the OLS estimator is an MM estimator based on E[xu] = 0. Now 
make the additional moment assumption that errors are conditionally symmetric, so 
that E[v3|x] = 0 and hence E[xu?] = 0. Then estimation of 3 may be based on the 


E[x(y x’ BY ] 0 f 


The MM estimator would attempt to estimate 8 as the solution to the corresponding 
sample moment conditions N~! X; x;(y; — x6) = 0 and N~! >, x;(y; — x BY = 0. 
However, with 2K equations and only K unknown parameters 3, it is not possible for 
all of these sample moment conditions to be satisfied. 

The GMM estimator instead sets the sample moments as close to zero as possible 
using quadratic loss. Then Bonm minimizes 


1 á 1 
O8) = | N Li Xiu; | Wy | wy 2i Xiti | , (6.2) 


$ a xiu} oe xiu} 


where u; = y; — x;@ and Wy is a 2K x 2K weighting matrix. For some choices 
of Wy this estimator is more efficient than OLS. This example is analyzed in Sec- 
tion 6.3.6. 
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6.2.5. Instrumental Variables Regression 


Instrumental variables estimation is a leading example of generalized method of mo- 
ments estimation. 

Consider the linear regression model y = x’ + u, with the complication that some 
components of x are correlated with the error term so that OLS is inconsistent for 6. 
Assume the existence of instruments z (introduced in Section 4.8) that are correlated 
with x but satisfy E[u|z] = 0. Then ELy — x’G|z] = 0. Using algebra similar to that 
used to obtain (6.1) for the OLS example, we multiply by z to get the K unconditional 
population moment conditions 


Elz — x’3)] = 0. (6.3) 


The method of moments estimator solves the corresponding sample moment condition 
i< , 
r me z; — X;8) = 0. 
i= 


If dim(z) = K this yields Bum = ($; z:x,)~! J; Zi yi, which is the linear IV estimator 
introduced in Section 4.8.6. 

No unique solution exists if there are more potential instruments than regressors, 
since then dim(z) > K and there are more equations than unknowns. One possibility 
is to use just K instruments, but there is then an efficiency loss. The GMM estimator 
instead chooses 6 to make the vector N7! Š; Zzi(yi — xX; 6) as small as possible using 
quadratic loss, so that Boum minimizes 


g ' ic 
On(B) = È X zoi- xø Wy È X zoi- xø ; (6.4) 
i=1 j=l 


where Wy is a dim(z) x dim(z) weighting matrix. The 2SLS estimator (see Sec- 
tion 4.8.6) corresponds to a particular choice of Wy. 

Instrumental variables methods for linear models are presented in considerable de- 
tail in Section 6.4. An advantage of the GMM approach is that it provides a way to 
specify the optimal choice of weighting matrix W y, leading to an estimator more effi- 
cient than 2SLS. 

Section 6.5 covers IV methods for nonlinear models. One advantage of the GMM 
approach is that generalization to nonlinear regression is straightforward. Then we 
simply replace y — x'8 in the preceding expression for Q y (6) by the nonlinear model 
error u = y — g(x’3) or u = r(y, x, B). 


6.2.6. Panel Data 


Another leading application of GMM and related estimation methods is to panel data 
regression. 

As an example, suppose yit = x;,3+uj+, where i denotes individual and t denotes 
time. From Section 6.2.1, pooled OLS regression of yj; on x;; is an MM estimator 
based on the condition E[x;;u;;] = 0. Suppose it is additionally assumed that the er- 
ror uj; is uncorrelated with regressors in periods other than the current period. Then 
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E[X;suit] = 0 for s Æ t provides additional moment conditions that can be used to ob- 
tain more efficient estimators. 
Chapters 22 and 23 provide many applications of GMM methods to panel data. 


6.2.7. Moment Conditions from Economic Theory 


Economic theory can generate moment conditions that can be used as the basis for 
estimation. 
Begin with the model 


ye = Ely: |x, B] + ur, 


where the first term on the right-hand side measures the “anticipated” component of 
y conditional on x and the second component measures the “unanticipated?” compo- 
nent. As examples, y may denote return on an asset or the rate of inflation. Under the 
twin assumptions of rational expectations and market clearing or market efficiency, 
we may obtain the result that the unanticipated component is unpredictable using any 
information that was available at time t for determining E[y|x]. Then 


E[l: — Ely Ix, BDIZ:] = 0, 


where Z, denotes information available at time t. 

By the law of iterated expectations, E[z,(y,—ELy,|x,;, 3])] = 0, where z, is formed 
from any subset of Z,. Since any part of the information set can be used as an instru- 
ment, this provides many moment conditions that can be the basis of estimation. If 
time-series data are available then GMM minimizes the quadratic form 


1 A 1 
07(B) = E Ve nt | Wr E ya nau 


where u; = y; — E[y:|X;, 8]. If cross-section data are available at a single time point t 
then GMM minimizes the quadratic form 


1 4 1 
On(B) = È 2a nu Ww È pay nu : 


where u; = y; — El y;|x;, 6] and the subscript t can be dropped as only one time period 
is analyzed. 

This approach is not restricted to the additive structure used in motivation. All 
that is needed is an error u; with the property that E[u,|Z,] = 0. Such conditions 
arise from the Euler conditions from intertemporal models of decision making un- 
der certainty. For example, Hansen and Singleton (1982) present a model of maxi- 
mization of expected lifetime utility that leads to the Euler condition E[u,|Z;] = 0, 
where u; = Bg iri — 1, 8141 = Cr41/C; is the ratio of consumption in two periods, 
and r;+, is asset return. The parameters 6 and «, the intertemporal discount rate and 
the coefficient of relative risk aversion, respectively, can be estimated by GMM using 
either time-series or cross-section data as was done previously, with this new defini- 
tion of u;. Hansen (1982) and Hansen and Singleton (1982) consider time-series data; 
MaCurdy (1983) modeled both consumption and labor supply using panel data. 
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Table 6.1. Generalized Method of Moments: Examples 


Moment Function h(-) Estimation Method 

y— u Method of moments for population mean 
x(y — x’) Ordinary least-squares regression 

z(y — x’) Instrumental variables regression 

ə In f(y |x, 6)/00 Maximum likelihood estimation 


6.3. Generalized Method of Moments 


This section presents the general theory of GMM estimation. Generalized method of 
moments defines a class of estimators. Different choice of moment condition and 
weighting matrix lead to different GMM estimators, just as different choices of dis- 
tribution lead to different ML estimators. We address these issues, in addition to pre- 
senting the usual properties of consistency and asymptotic normality and methods to 
estimate the variance matrix of the GMM estimator. 


6.3.1. Method of Moments Estimator 
The starting point is to assume the existence of r moment conditions for q parameters, 
E[h(w;, 80)] = 9, (6.5) 


where @ is ag x 1 vector, h(-) is an r x 1 vector function with r > q, and @o denotes 
the value of @ in the dgp. The vector w includes all observables including, where 
relevant, a dependent variable y, potentially endogenous regressors x, and instrumental 
variables z. The dependent variable y may be a vector, so that applications with systems 
of equations or with panel data are subsumed. The expectation is with respect to all 
stochastic components of w and hence y, x, and z. 

The choice of functional form for h(-) is qualitatively similar to the choice of model 
and will vary with application. Table 6.1 summarizes some single-equation examples 
of h(w) = h(y, x, z, 0) already presented in Section 6.2. 

If r = q then method of moments can be applied. Equality to zero of the population 
moment is replaced by equality to zero of the corresponding sample moment, and the 
method of moments estimator Onim is defined to be the solution to 


Š ~ 
z 2 h(w;, 0) = 0. (6.6) 
This is an estimating equations estimator that equivalently minimizes 


1 & Tid 
Qn(0) = È Z KO J È 2 J ; 


with asymptotic distribution presented in Section 5.4 and reproduced in (6.13) in Sec- 
tion 6.3.3. 
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6.3.2. GMM Estimator 


The GMM estimator is based on r independent moment conditions (6.5) while q pa- 
rameters are estimated. 

Ifr = q the model is said to be just-identified and the MM estimator in (6.6) can be 
used. More formally r = q is only a necessary condition for just-identification and we 
additionally require that Gp in Proposition 5.1 is of rank q. Identification is addressed 
in Section 6.3.9. 

If r > q the model is said to be overidentified and (6.6) has no solution for 0 as 
there are more equations (r) than unknowns (q). Instead, @ is chosen so that a quadratic 
form in N~! >, h(w;, 0) is as close to zero as possible. Specifically, the generalized 
methods of moments estimator cmm minimizes the objective function 


12 ' 12 
Qx(0)= È 2 h(w;, J Wy È 2 h(w;, J ; (6.7) 


where the r x r weighting matrix W y is symmetric positive definite, possibly stochas- 
tic with finite probability limit, and does not depend on @. The subscript N on Wy is 
used to indicate that its value may depend on the sample. The dimension r of Wy, 
however, is fixed as N — oo. The objective function can also be expressed in matrix 
notation as Oy(0) = N7'lH(@) x Wy x N~!'H(@)'L, where lis an N x 1 vector of 
ones and H(@) is an N x r matrix with ith row h(y;, x;, 0)’. 

Different choices of weighting matrix W y lead to different estimators that, although 
consistent, have different variances if r > q. A simple choice, though often a poor 
choice, is to let Wy be the identity matrix. Then Qy(0) = h? +h} +--- +h? is the 
sum of r squared sample averages, where h; = N7! >; h;(wi, 0) and h;(-) is the jth 
component of h(-). The optimal choice of Wy is given in Section 6.3.5. 

Differentiating Qn (0) in (6.7) with respect to @ yields the GMM first-order 


conditions 
1 & Ih; @ fo ols 
oie —V~h@)| =0, ; 
hÈ ð Pwl (0)| =0 (6.8) 


where h;(0) = h,(w;, 0) and we have multiplied by the scaling factor 1/2. These equa- 
tions will generally be nonlinear in @ and can be quite complicated to solve as 8 may 
appear in both the first and third terms. Numerical solution methods are presented in 
Chapter 10. 


6.3.3. Distribution of GMM Estimator 


The asymptotic distribution of the GMM estimator is given in the following proposi- 
tion, derived in Section 6.3.9. 


Proposition 6.1 (Distribution of GMM Estimator): Make the following as- 
sumptions: 


(i) The dgp imposes the moment condition (6.5); that is, E[h(w, 90)] = 0. 
(ii) The r x 1 vector function h(-) satisfies h(w, 0) = h(w, 0®) if 0® = 02. 
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(iii) The following r x q matrix exists and is finite with rank q: 
Go = plim Sai- al? (6.9) 


(iv) Wy > Wo, where Wo is finite symmetric positive definite. 


(v) N-12 EN hilo, > N [0, S(@o)], where 


5 [ pin’ | hi | ail (6.10) 


j=l 


So = plim N7! 


M- 


ll 
MA 


i 


Then the GMM estimator Bomm, defined to be a root of the first-order conditions 
ə Qn (0)/30 = 0 given in (6.8), is consistent for 0g and 


VN @cmm — 90) > N [0, (G) WoGo) 1 (G WoSoWoGo(GAWoGo)™!]. (611) 


Some leading specializations are the following. 
First, in microeconometric analysis data are usually assumed to be independent over 
i, so (6.10) simplifies to 


n , 
So = plim— 3 [nih oo (6.12) 


If additionally the data are assumed to be identically distributed then (6.9) and 
(6.10) simplify to Go = E[dh/d6' | „ ] and So = E[hh’ | ob a notation used by many 
authors. 

Second, in the just-identified case that r = q, the situation for many estimators 
including ML and LS, the results simplify to those already presented in Section 5.4 for 
the estimating equations estimator. To see this note that when r = q the matrices Go, 
Wo, and So are square matrices that are invertible, so (Gy)WoGo)! =G 9 Wo! (Go) i 
and the variance matrix in (6.11) simplifies. It follows that, for the MM estimator in 


(6.6), 


la; 


VN (6mm — 90) > N [0, G7 S67]. (6.13) 


An MM estimator can always be computed as a GMM estimator and will be invariant 
to the choice of full rank weighting matrix. 

Third, the best choice of matrix Wy is one such that Wọ = S5- Then the variance 
matrix in (6.11) simplifies to (GpS5 S7 !Go)~!. This is expanded on in Section 6.3.5. 


6.3.4. Variance Matrix Estimation 


Statistical inference for the GMM estimator is possible given consistent estimates G 
of Go, W of Wo, and S of So in (6.11). Consistent estimates are easily obtained under 
relatively weak distributional assumptions. 
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For Go the obvious estimator is 
~ a Ohi; 
G=— 
N 2 30' |F 
For Wo the sample weighting matrix W y is used. The estimator for the r x r matrix So 
varies with the stochastic assumptions made about the dgp. Microeconometric analysis 


usually assumes independence over i, so that Sp is of the simpler form (6.12). An 
obvious estimator is then 


(6.14) 


N 


aS 1 ~~ ~ 
= Ss h; (0)h; (0V. (6.15) 


Since h(-) is r x 1, there are at most a finite number of r(r + 1)/2 unique entries in So 
to be estimated. So S is consistent as N —> co without need to parameterize the 
variance E[h;h; ], assumed to exist, to depend on fewer parameters. A that is re- 
quired ate some mild additional assumptions to ensure that plim N“! >>, h; w = 
plim N7 “ea h;h;. For example, if h; = = x;u;, Where T}; is the OLS residual, we know 
from Section 4.4 that existence of fourth moments of the regressors needs to be 
assumed. 

Combining these results, we have that the GMM estimator is asymptotically nor- 
mally distributed with mean @o and estimated asymptotic variance 


V[6cmml = k (G'WyG) | G’WySWyG (G'WyG) '. (6.16) 


This variance matrix estimator is a robust estimator that is an extension of the Eicker— 
White heteroskedastic-consistent estimator for least- squares estimators. 

One on also take expectations and use Ge = NT D E[oh; /00 Ils g for Go and 
Se = =N oe E[h;h; Jlo ð for So. However, this usually requires additional distribu- 
tional assumptions to take the expectation, and the variance matrix estimate will not 
be as robust to distributional misspecification. 

In the time-series case h, is subscripted by time t, and asymptotic theory is based 
on the number of time periods T — oo. For time-series data, with h, a vector 
MA(q) process, the usual estimator of V[Ocuul is one proposed_ by Newey_ and 
Hr (1 eToys that uses (6.16) with S = Qo + D (1 — AQ: + YY, p> where Q; = 

i Z j+l h h ;- This permits time-series correlation in h, in addition to contem- 
poraneous correlation. Further details on covariance matrix estimation, including im- 
provements in the time-series case, are given in Davidson and MacKinnon (1993, Sec- 
tion 17.5), Hamilton (1994), and Haan and Levin (1997). 


6.3.5. Optimal Weighting Matrix 


Application of GMM requires specification of moment function h(-) and weighting 
matrix Wy in (6.7). 

The easy part is choosing Wy to obtain the GMM estimator with the smallest 
asymptotic variance given a specified function h(-). This is often called optimal GMM 
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even though it is a limited form of optimality since a poor choice of h(-) could still lead 
to a very inefficient estimator. 

For just-identified models the same estimator (the MM estimator) is obtained for 
any full rank weighting matrix, so one might just as well set Wy = I). 

For overidentified models with r > q, and Sp known, the most efficient GMM es- 
timator is obtained by choosing the weighting matrix Wy = Sọ ' Then the variance 
matrix given in the proposition simplifies and 


VN @cmm — 90) > N [0, (GS5'Go)!], (6.17) 


a result due to Hansen (1982). 

This result can be obtained using matrix arguments similar to those that establish 
that GLS is the most efficient WLS estimator in the linear model. Even more simply, 
one can work directly with the objective function. For LS estimators that minimize the 
quadratic form u’Wu the most efficient estimator is GLS that sets W = X! = V[u]!. 
The GMM objective function in (6.7) is of this quadratic form with u = N7! >; hi(@) 
and so the optimal W = (V[N~! 57, h,(@)])~! = S3 '. The optimal GMM estimator 
weights by the inverse of the variance matrix of the sample moment conditions. 


Optimal GMM 


In practice So is unknown and we let Wy = Ssh where S is consistent for So. The 
optimal GMM estimator can be obtained using a two-step procedure. At the first step 
a GMM estimator is obtained using a suboptimal choice of Wy, such as Wy = I, 
for simplicity. From this first step, form estimate S using (6.15). At the second step 
perform an optimal GMM estimator with optimal weighting matrix Wy = S~., 

Then the optimal GMM estimator or two-step GMM estimator Oocmm based on 
h;(@) minimizes 


1A aa 12 
On(0) = Fao) S no] (6.18) 


The limit distribution is given in (6.17). The optimal GMM estimator is asymptoti- 
cally normally distributed with mean 09 and estimated asymptotic variance with the 
relatively simple formula 


Vocum] = NGS. (6.19) 


Usually evaluation of G and Sis at Chere so S uses the same formula as S except that 
evaluation is at Oca An alternative is to continue to evaluate (6.19) at the first-step 
estimator, as any consistent estimate of 09 can be used. 

Remarkably, the optimal GMM estimator in (6.18) requires no additional stochastic 
assumptions beyond those needed to permit use of (6.16) to estimate the variance 
matrix of suboptimal GMM. In both cases S needs to be consistent for Sọ and from the 
discussion after (6.15) this requires few additional assumptions. This stands in stark 
contrast to the additional assumptions needed for GLS to be more efficient than OLS 
when errors are heteroskedastic. Heteroskedasticity in the errors will affect the optimal 
choice of h;(@), however (see Section 6.3.7). 
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Small-Sample Bias of Two-Step GMM 


Theory suggests that for overidentified models it is best to use optimal GMM. In imple- 
mentation, however, the theoretical optimal weighting matrix Wy = So. needs to be 
replaced by a consistent estimate S-!. This replacement makes no difference asymp- 
totically, but it will make a difference in finite samples. In particular, individual obser- 
vations that increase h;(0) in (6. P) are likely to increase S=N-! yy f; y m (6.18), 
leading to correlation between N`! >>; h;(0) and S. Note that So = plim NT! 5°, h;h; 
is not similarly affected because the probability limit is taken. 

Altonji and Segal (1996) demonstrated this problem in estimation of covariance 
structure models using panel data (see Section 22.5). They used the related minimum 
distance estimator (see Section 6.7) but in the literature their results are intrepreted as 
being relevant to GMM estimation with cross-section data or short panels. In simula- 
tions the optimal estimator was more efficient than a one-step estimator, as expected. 
However, the optimal estimator had finite-sample bias so large that its root mean- 
squared error was much larger than that for the one-step estimator. 

Altonji and Segal (1996) also proposed a variant, an independently weighted op- 
timal estimator that forms the weighting matrix using observations other than used to 
construct the sample moments. They split the sample into G groups, with G = 2 an 
obvious choice, and minimize 


1 
Ox) = — J, he (OS ph (0), (6.20) 


where h, (0) is computed for the gth group and S- g) is computed usg all but the gth 
group. This estimator is less biased, since the weighting matrix sc g) 1s by construction 
independent of h,(@). However, splitting the sample leads to eidency loss. Horowitz 
(1998a) instead used the bootstrap (see Section 11.6.4). 

In the Altonji and Segal (1996) example h; involves second moments, so S involves 
fourth moments. Finite-sample problems for the optimal estimator may not be as sig- 
nificant in other examples where h; involves only first moments. Nonetheless, Altonji 
and Segal’s results do suggest caution in using optimal GMM and that differences 
between one-step GMM and optimal GMM estimates may indicate problems of finite- 
sample bias in optimal GMM. 


Number of Moment Restrictions 


In general adding further moment restrictions improves asymptotic efficiency, as it 
reduces the limit variance (GjS5 'Go)7! of the optimal GMM estimator or at worst 
leaves it unchanged. 

The benefits of adding further moment conditions vary with the application. For ex- 
ample, if the estimator is the MLE then there is no gain since the MLE is already fully 
efficient. The literature has focused on IV estimation where gains may be considerable 
because the variable being instrumented may be much more highly correlated with a 
combination of many instruments than with a single instrument. 

There is a limit, however, as the number of moment restrictions cannot exceed 
the number of observations. Moreover, adding more moment conditions increases the 
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likelihood of finite-sample bias and related problems similar to those of weak instru- 
ments in linear models (see Section 4.9). Stock et al. (2002) briefly consider weak 
instruments in nonlinear models. 


6.3.6. Regression with Symmetric Error Example 


To demonstrate the GMM asymptotic results we return to the additional moment re- 
strictions example introduced in Section 6.2.4. For this example the objective function 
for cee has already been given in (6.2). All that is required is specification of Ww, 
such as Wy =I. 

To obtain the distribution of this estimator we use the general notation of Section 
6.3. The function h(-) in (6.5) specializes to 


= x(y = x’) ohty, X, B) a —xx’ 
AEUR E = “BP | 7 ap ee - 2 | 


These expressions lead directly to expressions for Go and So using (6.9) and (6.12), so 
that (6.14) and (6.15) then yield consistent estimates 


1 2 J 
= aes Xx; Xx. 
é-| ene | (6.21) 
-y 2o 3M; XX; 
and 
1 ~) 1 1 ~A 1 
ce E Dei WRX; ok | (6.22) 
=) tl wy y! 1 Heel |? : 
N Èi MEXR) N DL; XiX; 


where t; = y — x, 8. Alternative estimates can be obtained by first evaluating the ex- 
pectations in Go and So, t but this will require assumptions on E[u?|x], E[u*|x], and 
E[u®|x]. Substituting G, S, and Wy into (6.16) gives the estimated asymptotic vari- 
ance matrix for Bomm: 

Now consider GMM with an optimal weighting matrix. This again minimizes (6.2), 
but from (6.18) now Wy = Ss , where S is defined in (6.22). Computation of S re- 
quires first-step consistent estimates B. An obvious choice is GMM with Wy = I. 
In this example the OLS estimator is also consistent and could instead be used. 
Using (6.19) gives this two-step estimator an estimated asymptotic variance matrix 


V[Bocmm] equal to 


yo; WiXiX Dixa YO; xx, T $O; WiXiX; 7 
[Eral Em Face] [E2] 

where w; = y; — locn and the various divisions by N have canceled out. 

Analytical results for the efficiency gain of optimal GMM in this example are eas- 
ily obtained by specialization to the nonregression case where y is iid with mean u. 
Furthermore, assume that y is Laplace distributed with scale parameter equal to unity, 
in which case the density is f(y) = (1/2) x exp{—|y — u|} with ELy] = u, V[y] = 2, 
and higher central moments E[(y — j)"] equal to zero for r odd and equal to r! for 
r even. The sample median is fully efficient as it is the MLE, and it can be shown to 
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have asymptotic variance 1/N. The sample mean y is inefficient with variance V[y] = 
VLy]/N = 2/N. The optimal GMM estimator 77°"' based on the two moment condi- 
tions E[(y — u)] = 0 and E[(y — 2)*] = 0 has weighting matrix that places much less 
weight on the second moment condition, because it has relatively high variance, and 
has negative off-diagonal entries. The optimal GMM estimator {i9gym can be shown 
to have asymptotic variance 1.7143/N (see Exercise 6.3). It is therefore more efficient 
than the sample mean (variance 2/N), though is still considerably less efficient than 
the sample median. 

For this example the identity matrix is an exceptionally poor choice of weighting 
matrix. It places too much weight on the second moment condition, yielding a sub- 
optimal GMM estimator of u with asymptotic variance 19.14/N that is many times 
greater than even V[¥] = 2/N. For details see Exercise 6.3. 


6.3.7. Optimal Moment Condition 


Section 6.3.5 gives the surprising result that optimal GMM requires essentially no 
more assumptions than does GMM without an optimal weighting matrix. However, 
this optimality is very limited as it is conditional on the choice of moment function 
h(-) in (6.5) or (6.18). 

The GMM defines a class of estimators, with different choice of h(-) correspond- 
ing to different members of the class. Some choices of h(-) are better than others, de- 
pending on additional stochastic assumptions. For example, h; = x;u; yields the OLS 
estimator whereas h; = x;u;/V[u;|x;] yields the GLS estimator when errors are het- 
eroskedastic. This multitude of potential choices for h(-) can make any particular 
GMM estimator appear ad hoc. However, qualitatively similar decisions have to be 
made in m-estimation in choosing, for example, to minimize the sum of squared errors 
rather than the weighted sum of squared errors or the sum of absolute deviations of 
errors. 

If complete distributional assumptions are made the most efficient estimator is the 
MLE. Thus the optimal choice of h(-) in (6.5) is 


d In f(w, 0) 
cry ae 
where f(w, 0) is the joint density of w. For regression with dependent variable(s) y 
and regressors x this is the unconditional MLE based on the unconditional joint den- 
sity f(y, x, 0) of y and x. In many applications f(y, x, 0) = f(y|x, 0)g(x), where the 
(suppressed) parameters of the marginal density of x do not depend on the parameters 
of interest 0. Then it is just as efficient to use the conditional MLE based on the con- 
ditional density f(y|x, 0). This can be used as the basis for MM estimation, or GMM 
estimation with weighting matrix Wy = I,, though any full-rank matrix Wy will also 
give the MLE. This result is of limited practical use, however, as the purpose of GMM 
estimation is to avoid making a full set of distributional assumptions. 

When incomplete distributional assumptions are made, a common starting point is 
specification of a conditional moment condition, where conditioning is on exoge- 
nous variables. This is usually a low-order moment condition for the model error such 


h(w, 0) = 
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as E[u|x] = 0 or E[u|z] = 0. This conditional moment condition can lead to many 
unconditional moment conditions that might be the basis for GMM estimation, such 
as E[zu] = 0. Newey (1990a, 1993) obtained results on the optimal choice of uncon- 
ditional moment condition for data independent over i. 

Specifically, begin with s conditional moment condition restrictions 


E[r(y, x, 99)|z] = 0, (6.23) 


where r(-) is a residual-type s x 1 vector function introduced in Section 6.2.2. A scalar 
example is E[y — x’@9|z] = 0. The instrumental variables notation is being used where 
x are regressors, some potentially endogenous, and z are instruments that include the 
exogenous components of x. In simpler models without endogeneity z = x. 

GMM estimation of the q parameters 0 based on (6.23) is not possible, as typically 
there are only a few conditional moment restrictions, and often just one, so s < q. 
Instead, we introduce an r x s matrix function of the instruments D(z), where r > q, 
and note that by the law of iterated expectations E[D(z)r(y, x, 0o)] = 0, which can be 
used as the basis for GMM estimation. The optimal instruments or optimal choice of 
matrix function D(z) can be shown to be the q x s matrix 


ar(y, x, Ao)’ 


D*(z, 0) =E 
(z, Ao) | 30 


| {V [r(y, x, 9o)|z]}"'. (6.24) 
A derivation is given in, for example, Davidson and MacKinnon (1993, p. 604). The 
optimal instrument matrix D*(z) is ag x s matrix, so the unconditional moment con- 
dition E[D*(z)r(y, x, 0o)] = 0 yields exactly as many moment conditions as param- 
eters. The optimal GMM estimator simply solves the corresponding sample moment 
conditions 


1 N 
x 3 D*(z;, Or, Xi, 0) = 0. (6.25) 


The optimal estimator requires additional assumptions, namely the expectations 
used in forming D*(z, 0o) in (6.24), and implementation requires replacing unknown 
parameters by known parameters so that generated regressors D are used. 

For example, if r(y, x, 0) = y — exp(x’0) then dr/d0 = — exp(x’@)x and (6.24) 
requires specification of E[exp(x’@9)x|z] and V[y — exp(x’0)|z]. One possibility is 
to assume E[exp(x’@)x|z] is a low-order polynomial in z, in which case there will 
be more moment conditions than parameters and so estimation is by GMM rather 
than simply by solving (6.25), and to assume errors are homoskedastic. If these addi- 
tional assumptions are wrong then the estimator is still consistent, provided (6.23) is 
valid, and consistent standard errors can be obtained using the robust form of the vari- 
ance matrix in (6.16). It is common to more simply use z rather than D*(z, 0) as the 
instrument. 


Optimal Moment Condition for Nonlinear Regression Example 


The result (6.24) is useful in some cases, especially those where z = x. Here we con- 
firm that GLS is the most efficient GMM estimator based on E[u|x] = 0. 
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Consider the nonlinear regression model y = g(x, 6) + u. If the starting point is 


the conditional moment restriction E[u|x] = 0, or E[y — g(x, 8)|x] = 0, then z = x in 
(6.23), and (6.24) yields 


D*(x, 6) = e| 0 - g(x, ads] {v [y = g(x, By)Ix]} 


ap 
dg, Bo) 1 
ap V [ulx]’ 


which requires only specification of V[u|x]. From (6.25) the optimal GMM estimator 
directly solves the corresponding sample moment conditions 


1 5 3L, B) 7 (yi — g(x, B)) 
3B of 


l 


= 0, 


where o? = V[u;|x;] is functionally independent of 3. These are the first-order condi- 
tions for generalized NLS when the error is heteroskedastic. Implementation is possi- 
ble using a consistent estimate G? of ož, in which case GMM estimation is the same 
as FGNLS. One can obtain standard errors robust to misspecification of o? as detailed 
in Section 5.8. 

Specializing to the linear model, g(x, 3) = x'ß and the optimal GMM estimator 
based on E[u|x] = 0 is GLS, and specializing further to the case of homoskedastic 
errors, the optimal GMM estimator based on E[u|x] = 0 is OLS. As already seen in 
the example in Section 6.3.6, more efficient estimation may be possible if additional 
conditional moment conditions are used. 


6.3.8. Tests of Overidentifying Restrictions 


Hypothesis tests on 0 can be performed using the Wald test (see Section 5.5), or with 
other methods given in Section 7.5. 

In addition there is a quite general model specification test that can be used for over- 
identified models with morgi moment conditions (r) than parameters (q). The test is one 
of the closeness of N7 D h; to 0, where h; = h(w;, 6). This is an obvious test of Ho: 
E[h(w, 00)] = 0, the ante population moment conditions. For just-identified models, 
estimation imposes N“! 5°, h; = 0 and the test is not possible. For over-identified 
models, however, the first-order conditions (6.8) set a q x r matrix times N- D h; 
to zero, where q < r, so >); h; #0. 

In the special case that @ is estimated by Oncniet defined in (6.18), Hansen (1982) 
showed that the overidentifying restrictions (OIR) test statistic 


OIR = (v7! yh) & (N~ 5 hi) (6.26) 


is asymptotically distributed as x?(r — q) under Ho -E[h(w, 8o)] = 0. Note that OIR 
equals the GMM objective function (6.18) evaluated at Oocmm. If OIR is large then 
the population moment conditions are rejected and the GMM estimator is inconsistent 
for 0. 
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It is not obvious a priori that the particular quadratic form in N~' X, h; given in 
(6.26) is x?(r — q) distributed under Ho. A formal derivation is given in the next 
section and an intuitive explanation in the case of linear IV estimation is provided 
in Section 8.4.4. 

A classic application is to life-cycle models of consumption (see Section 6.2.7), in 
which case the orthogonality conditions are Euler conditions. A large chi-square test 
statistic is then often stated to mean rejection of the life-cycle hypothesis. However, it 
should instead be more narrowly interpreted as rejection of the particular specification 
of utility function and set of stochastic assumptions used in the study. 


6.3.9. Derivations for the GMM Estimator 


The algebra is simplified by introducing a more compact notation. The GMM estimator 
minimizes 


On(O) = gy (0) Wren (0), (6.27) 
where gy(0) = N7! >>; h;(@). Then the GMM first-order conditions (6.8) are 
Gy @O)Wygn@) = 0, (6.28) 


where Gy (0) = dgy(0)/00' = N~! >>, 3h; (0)/30". 

For consistency we consider the informal condition that the probability limit of 
IQn(A)/dO|g, equals zero. From (6.28) this will be the case as Gy(@9) and Wy 
have finite probability limits, by assumptions (iii) and (iv) of Proposition 6.1, and 
plim gy(@9) = 0 as a consequence of assumption (v). More intuitively, gy(@9) = 
N~!5°,h;(@0) has probability limit zero if a law of large numbers can be applied 
and E[h;(@9)] = 0, which was assumed at the outset in (6.5). 

The parameter 0o is identified by the key assumption (ii) and additionally assump- 
tions (iii) and (iv), which restrict the probability limits of Gy (0o) and Wy to be full- 
rank matrices. The assumption that Gp = plim Gy(@p) is a full-rank matrix is called 
the rank condition for identification. A weaker necessary condition for identification 
is the order condition that r > q. 

For asymptotic normality, a more general theory is needed than that for an m- 
estimator based on an objective function Qy(3) =N7! >=; g(w;, 0) that involves just 
one sum. We rescale (6.28) by multiplication by VN, so that 


Gy) WyVNen@) = 0. (6.29) 


The approach of the general Theorem 5.3 is to take a Taylor series expansion around 
Oo of the entire left-hand side of (6.28). Since 0 appears in both the first and third 
terms this is complicated and requires existence of first derivatives of G y (0) and hence 
second derivatives of gy(0). Since Gy@) and Wy have finite probability limits it is 
sufficient to more simply take an exact Taylor series expansion of only /Ngy (0). This 
yields an expression similar to that in the Chapter 5 discussion of m-estimation, with 


JNgy(@) = VNgn(00) + Gy(0*)VN(@ — 00), (6.30) 
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recalling that Gy(@) = dgy(0)/06’, where 0* isa point between ĝo and 6. Substitut- 
ing (6.30) back into (6.29) yields 


Gy @Y Wy | V Nano) + Gy(O* VW NG@ — 80)] = 
Solving for VN (0 — 0o) yields 
VNG — 00) = — [Gn@ WnG (0)] ' Gy@YWyVNgy Oo). (6.31) 


Equation (6.31) is the key result for obtaining the limit distribution of the GMM 
estimator. We obtain the probability limits of each of the first five terms using 0 + bo, 
given consistency, in which case et 4 Oo. The last term on the right-hand side of 


(6.31) has a limit normal distribution by assumption (v). Thus 
VN(@ — 00) & —(GWoGo)!G,Wo x NIO, Sol, 


where Go, Wo, and So have been defined in Proposition 6.1. Applying the limit normal 
product rule (Theorem A.17) yields (6.11). 

This derivation treats the GMM first-order conditions as being q linear combina- 
tions of the r sample moments gv (6), since Gy (6) Wy is a q xr matrix. The MM 
estimator is the special case q = r, since then Gwi@y Wy is a full-rank square matrix, 
so Gy) Wyen() = 0 implies that gv(0) = 

To derive the distribution of the OIR test statistic in (6.26), begin with a first-order 
Taylor series expansion of /Ng v(0) around ĝo to obtain 

VNgv@ocum) = VN gy (80) + Gn(0*)VN @ocmm — 90) 
= VNgn(00) — Go(GpSp 'Go)~'GySo' VN gw (00) + 0p (1) 
= [I — MoS% "VN gy (80) + op(1), 
where the second equality uses (6.31) with Wy consistent for Sg l My = 
Go(GpSo 'Go)7!G), and o p(1) is defined in Definition A.22. It follows that 
S3 V Ngn @ocmm) = So PU — MoS5'1VN gy (00) + op (1) (6.32) 
= [I — S5 MoS 7189? VN gw(0) + op (1). 


Now [I —$9'/?MoSp 1/7] = [I — $)'/’Go(GiS, 0 Go) 1Gi S71 is an idempotent 


matrix of rank (r— q), and S OION —> N[0, I] given VNgn (0o) £ > 
N[0, So]. From standard results for quadratic forms of normal variables it follows 
that the inner product 


tw = (S3 V Ngn(Oocmm)) (S7 V Ngn (Oocmm)) 


converges to the x?(r — q) distribution. 


6.4. Linear Instrumental Variables 


Correlation of regressors with the error term leads to inconsistency of least- 
squares methods. Examples of such failure include omitted variables, simultaneity, 
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measurement error in the regressors, and sample selection bias. Instrumental variables 
methods provide a general approach that can handle any of these problems, provided 
suitable instruments exist. 

Instrumental variables methods fall naturally into the GMM framework as a surplus 
of instruments leads to an excess of moment conditions that can be used for estimation. 
Many IV results are most easily obtained using the GMM framework. 

Linear IV is important enough to appear in many places in this book. An introduc- 
tion was given in Sections 4.8 and 4.9. This section presents single-equation linear IV 
as a particular application of GMM. For completeness the section also presents the 
earlier literature on a special case, the two-stage least-squares estimator. Systems lin- 
ear IV estimation is summarized in Section 6.9.5. Tests of endogeneity and tests of 
overidentifying restrictions for linear models are detailed in Section 8.4. Chapter 22 
presents linear IV estimation with panel data. 


6.4.1. Linear GMM with Instruments 
Consider the linear regression model 
yi = X; +i, (6.33) 


where each component of x is viewed as being an exogenous regressor if it is uncor- 
related with the error in model (6.33) or an endogenous regressor if it is correlated. 
If all regressors are exogenous then LS estimators can be used, but if any components 
of x are endogenous then LS estimators are inconsistent for 6. 

From Section 4.8, consistent estimates can be obtained by IV estimation. The key 
assumption is the existence of anr x 1 vector of instruments z that satisfies 


E[u;|z;] = 0. (6.34) 
Exogenous regressors can be instrumented by themselves. As there must be at least as 
many instruments as regressors, the challenge is to find additional instruments that at 


least equal the number of endogenous variables in the model. Some examples of such 
instruments have been given in Section 4.8.2. 


Linear GMM Estimator 


From Section 6.2.5, the conditional moment restriction (6.34) and model (6.33) imply 
the unconditional moment restriction 


E[z;(yi—x;3)] = 0, (6.35) 
where for notational simplicity the following analysis uses 6 rather than the more 
formal (9 to denote the true parameter value. A quadratic form in the corresponding 
sample moments leads to the GMM objective function Qy(@) given in (6.4). 


In matrix notation define y = X8 + u as usual and let Z denote the N x r matrix 
of instruments with ith row z;. Then }-; z;(y;—x;,3) = Z'u and (6.4) becomes 


1 1 
On(B) = Fc E xyz] Ww a2 E xø , (6.36) 
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where Wy is an r x r full-rank symmetric weighting matrix with leading examples 
given at the end of this section. The first-order conditions 


dQn(B) _ 
aB 


can actually be solved for 8 in this special case of GMM, leading to the GMM esti- 
mator in the linear IV model 


1 1 1 1 aoe 
2| xz] Wy | wz -x9)]= 0 


Bomm = [X’ZWyZ'x] | X’ZWyZ'y, (6.37) 


where the divisions by N have canceled out. 


Distribution of Linear GMM Estimator 


The general results of Section 6.3 can be used to derive the asymptotic distribution. 
Alternatively, since an explicit solution for Bomy exists the analysis for OLS given in 
Section 4.4. can be adapted. Substituting y = XG + u into (6.37) yields 


Boum = B + [(N7!X'Z) Wy (N7!Z/X)] | (N7EX’Z) Wy (N~'Z'u). (6.38) 


From the last term, consistency of the GMM estimator essentially requires that 
plim N~'Z'u = 0. Under pure random sampling this requires that (6.35) holds, 
whereas under other common sampling schemes (see Section 24.3) the stronger as- 
sumption (6.34) is needed. 

Additionally, the rank condition for identification of 3 that plim N~!Z’X is of 
rank K ensures that the inverse in the right-hand side exists, provided Wy is of full 
rank. A weaker order condition is that r > K. 

The limit distribution is based on the expression for ~N (Bomm — 2) obtained by 
simple manipulation of (6.38). This yields an asymptotic normal distribution for Beima 
with mean 8 and estimated asymptotic variance 


ViBomml = N [X’'ZWyZ'X] | [X’'ZWySWyZ'X] [X'ZWyZ'X] |, (6.39) 


where S is a consistent estimate of 
1 
S = lim z YE [u?2;2; | : 
i=l 


given the usual cross-section assumption of independence over i. The essential addi- 


tional assumption needed for (6.39) is that N 1270 Ed NTO, S]. Result (6.39) also 
follows from Proposition 6.1 with h(-) = z(y — x’) and hence dh/03’ = —zx’. 
For cross-section data with heteroskedastic errors, S is consistently estimated by 


ae” (ip ANS 
S= — X zz, = Z'DZ/N, (6.40) 


l 


where t; = yi — x; Bomm is the GMM residual and D is an N x N diagonal matrix 
with entries &?. A commonly used small-sample adjustment is to divide by N — K 
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Table 6.2. GMM Estimators in Linear IV Model and Their Asymptotic Variance® 


Estimator Definition and Asymptotic Variance 

GMM Boum = (X'ZW ZX}! X’ZWyZ'y 

(general Wy) VIB] = NIX'ZWy ZX]! EX'ZWy SW Z/X][X’ZW ZX}! 
Optimal GMM Boom = (X’ZS Z'XI'X'ZS Zy 

(Wy = S-) VIA] = NIX’'ZS ZX]! 

2SLS Bogs = (X/ZZ/Z)  Z'XI X'Z(Z' Zy 'Z'y 


(Wy =[NZZI D) VIO] = N[X’Z(Z/Z) ZX IX ZZZ SZZ ZX] 
x [X ZZD ZX]! 

[8] = s?[X’Z(Z'Z)_ 'Z/X]~! if homoskedastic errors 

y= [Z'X}'Z'y 


IV V A 
61 = N(Z’'X) 'S(X'Z)"! 


(just-identified) 


<10) <) 


4 Equations are based on a linear regression model with dependent variable y, regressors X, and instruments 
Z. S is defined in (6.40) and s? is defined after (6.41). All variance matrix estimates assume errors that are 
independent across observations and heteroskedastic, aside from the simplification for homoskedastic errors 
given for the 2SLS estimator. Optimal GMM uses the optimal weighting matrix. 


rather than N in the formula for S. In the more restrictive case of homoskedastic errors, 
E[u?|z;] = o? and so S = lim N`! >, oE[z;z/], leading to estimate 


S= ZZN, (6.41) 


= Nw. ; P ST 
where s? = (N — K)! $`;_; @? is consistent for 0”. These results mimic similar re- 


sults for OLS presented in Section 4.4.5. 


6.4.2. Different Linear GMM Estimators 


Implementation of the results of Section 6.4.1 requires specification of the weighting 
matrix Wy. For just-identified models all choices of Wy lead to the same estima- 
tor. For overidentified models there are two common choices of Wy, given in the 
following. 

Table 6.2 summarizes these estimators and gives the appropriate specialization of 
the estimated variance matrix formula given in (6.39), assuming independent het- 
eroskedastic errors. 


Instrumental Variables Estimator 


In the just-identified case r = K and X’Z is a square matrix that is invertible. Then 
[X’ZWyZ'X]~! = (Z'X)'W,,'(X’Z)"! and (6.37) simplifies to the instrumental 
variables estimator 

Bw = (ZX) 'Z'y, (6.42) 


introduced in Section 4.8.6. For just-identified models the GMM estimator for any 
choice of W y equals the IV estimator. 
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The simple IV estimator can also be used in overidentified models, by discarding 
some of the instruments so that the model is just-identified, but this results in an effi- 
ciency loss compared to using all the instruments. 


Optimal-Weighted GMM 


From Section 6.3.5, for overidentified models the most efficient GMM estimator, 
meaning GMM with optimal choice of weighting matrix, sets Wy = S~! in (6.37). 
The optimal GMM estimator or two-step GMM estimator in the linear IV model 
is 
A) Ir- ! =1 Ix- ! 
Bocmm = [X DSZO] XDS Zy). (6.43) 


For heteroskedastic errors, S is computed using (6.40) based on a consistent first-step 
estimate B such as the 2SLS estimator defined in (6.44). White (1982) called this 
estimator a two-stage IV estimator, since both steps entail IV estimation. 

The estimated asymptotic variance matrix for optimal GMM given in Table 6.2 
is of relatively simple form as (6.39) simplifies when Wy = Solin computing the 
estimated variance one can use § as presented in Table 6.2, but it is more common to 
instead use an estimator S, say, that is also computed using (6.40) but evaluates the 
residual at the optimal GMM estimator rather than the first-step estimate used to form 
S in (6.43). 


Two-Stage Least Squares 


If errors are homoskedastic rather than heteroskedastic, sis [s?N—!Z/Z]-! from 
(6.41). Then Wy = (N~!Z/Z)~! in (6.37), leading to the two-stage least-squares 
estimator, introduced in Section 4.8.7, that can be expressed compactly as 


Basis = [X'PzX] | [X’Pzy], (6.44) 


where Pz = Z(ZZ’')~'Z’. The basis of the term two-stage least-squares is presented 
in the next section. The 2SLS estimator is also called the generalized instrumental 
variables (GIV) estimator as it generalizes the IV estimator to the overidentified 
case of more instruments than regressors. It is also called the one-step GMM because 
(6.44) can be calculated in one step, whereas optimal GMM requires two steps. 

The 2SLS estimator is asymptotically normal distributed with estimated asymptotic 
variance given in Table 6.2. The general form should be used if one wishes to guard 
against heteroskedastic errors whereas the simpler form, presented in many introduc- 
tory textbooks, is consistent only if errors are indeed homoskedastic. 


Optimal GMM versus 2SLS 


Both the optimal GMM and the 2SLS estimator lead to efficiency gains in overiden- 
tified models. Optimal GMM has the advantage of being more efficient than 2SLS, 
if errors are heteroskedastic, though the efficiency gain need not be great. Some of 
the GMM testing procedures given in Section 7.5 and Chapter 8 assume estimation 


187 


GENERALIZED METHOD OF MOMENTS AND SYSTEMS ESTIMATION 


using the optimal weighting matrix. Optimal GMM has the disadvantage of requiring 
additional computation compared to 2SLS. Moreover, as discussed in Section 6.3.5, 
asymptotic theory may provide a poor small-sample approximation to the distribution 
of the optimal GMM estimator. 

In cross-section applications it is common to use the less efficient 2SLS, though 
with inference based on heteroskedastic robust standard errors. 


Even More Efficient GMM Estimation 


The estimator Bocmm is the most efficient estimator based on the unconditional mo- 
ment condition E[z;u;] = 0, where u; = y;—x; B. However, this is not the best moment 
condition to use if the starting point is the conditional moment condition E[u;|z;] = 0 
and errors are heteroskedastic, meaning V[u;|z;] varies with z;. 

Applying the general results of Section 6.3.7, we can write the optimal moment 
condition for GMM estimation based on E[u;|z;] = 0 as 


E[E [x;|z; | uj/V [u;lz;]] = 0. (6.45) 


As with the LS regression example in Section 6.3.7, one should divide by the error 
variance V[u|z]. Implementation is more difficult than in the LS case, however, as 
a model for E[x|z] needs to be specified in addition to one for V[u|z]. This may be 
possible with additional structure. In particular, for a linear simultaneous equations 
system E[x;|z,] is linear in z so that estimation is based on E[x;u; /V[u;|z;]] = 0. 

For linear models the GMM estimator is usually based on the simpler condition 
E[z;u;] = 0. Given this condition, the optimal GMM estimator defined in (6.43) is the 
most efficient GMM estimator. 


6.4.3. Alternative Derivations of Two-Stage Least Squares 


The 2SLS estimator, the standard IV estimator for overidentified models, was derived 
in Section 6.4.2 as a GMM estimator. 

Here we present three other derivations of the 2SLS estimator. One of these deriva- 
tions, due to Theil, provided the original motivation for 2SLS, which predates GMM. 
Theil’s interpretation is emphasized in introductory treatments. However, it does not 
generalize to nonlinear models, whereas the GMM interpretation does. 

We consider the linear model 


y=X6+u, (6.46) 


with E[u|Z] = 0 and additionally V[u|Z] = o7 I. 


GLS in a Transformed Model 
Premultiplication of (6.46) by the instruments Z’ yields the transformed model 
Zy =ZXB+Z'u. (6.47) 
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This transformed model is often used as motivation for the IV estimator when r = K, 
since ignoring Z’u since N~!Z/u — 0 and solving yields B = (Z'X)"'Z’y. 

Here instead we consider the overidentified case. Conditional on Z the error Z'u has 
mean zero and variance o7Z’Z given the assumptions after (6.46). The efficient GLS 
estimator of 3 in model (6.46) is then 

B= [XZ ZZ ZX] | X'Z(0?Z'Z)"'Z'y, (6.48) 
which equals the 2SLS estimator in (6.44) since the multipliers o? cancel out. More 
generally, note that if the transformed model (6.47) is instead estimated by WLS with 
weighting matrix Wy then the more general estimator (6.37) is obtained. 


Theil’s Interpretation 


Theil (1953) proposed estimation by OLS regression of the original model (6.46), 
except that the regressors X are replaced by a prediction X that is asymptotically un- 
correlated with the error term. 

Suppose that in the reduced form model the regressors X are a linear combination 
of the instruments plus some error, so that 


X= Zl +v, (6.49) 


n II isa K x r matrix. Multivariate OLS regression of X on Z yields estimator 
= (Z'Z)!Z'X and OLS predictions X = ZI or 


X = P7X, 
where Pz = Z(Z'Z)~'Z’. OLS regression of y on X rather than y on X yields estimator 
Bonet = RD Ry. (6.50) 


Theil’s interpretation permits computation by two OLS regressions, with the first-stage 
OLS giving X and the second-stage OLS giving B, leading to the term two-stage least- 
squares estimator. 

To establish consistency of this estimator reexpress the linear model (6.46) as 


y =X6+(X-X)8+u. 


The second-stage OLS regression of y on x yields a consistent estimator of Bi if the re- 
gressor X is asymptotically uncorrelated with the composite error term (X— K +u. 
If X were any proxy variable there is no reason for this to hold; however, here X is un- 
correlated with (X— X) as an OLS prediction is orthogonal to the OLS residual. Thus 
plim N~!X’/(X—X) = 0. Also, 


N7'X'u = N7!X’Pzu = N7X’Z(N7!Z/Z) NZ. 


Then X is asymptotically uncorrelated with u provided Z is a valid instrument so that 
plim N~!Z’u = 0. This consistency result for Brpei depends heavily on the linearity 
of the model and does not generalize to nonlinear models. 
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Theil’s estimator in (6.50) equals the 2SLS estimator defined earlier in (6.44). We 
have 


Brne = AR K'y 
= (X'P7PzX) 'X'Pzy 
= (X'P7X) 'X'Pzy, 


the 2SLS estimator, using P, Pz = Pz in the final equality. 

Care is needed in implementing 2SLS using Theil’s method. The second-stage OLS 
will give the wrong standard errors, even if errors are homoskedastic, as it will esti- 
mate o° using the second-stage OLS regression residuals (y — KB) rather than the ac- 
tual residuals (y — X@). In practice one may also make adjustment for heteroskedastic 
errors. It is much easier to use a program that offers 2SLS as an option and directly 
computes (6.44) and the associated variance matrix given in Table 6.2. 

The 2SLS interpretation does not always carry over to nonlinear models, as detailed 
in Section 6.5.4. The GMM interpretation does, and for this reason it is emphasized 
here more than Theil’s original derivation of linear 2SLS. 

Theil actually considered a model where only some of the regressors X are endoge- 
nous and the remaining are exogenous. The preceding analysis still applies, provided 
all the exogenous components of X are included in the instruments Z. Then the first- 
stage OLS regression of the exogenous regressors on the instruments fits perfectly and 
the predictions of the exogenous regressors equal their actual values. So in practice at 
the first-stage just the endogenous variables are regressed on the instruments, and the 
second-stage regression is of y on the exogenous regressors and the first-stage predic- 
tions of the endogenous regressors. 


Basmann’s Interpretation 


Basmann (1957) proposed using as instruments the OLS reduced form predictions 
X = P7X for the simple IV estimator in the just-identified case, since there are then 
exactly as many instruments X as regressors X. This yields 


Bpasmann = RX) Ry. (6.51) 


This is consistent since plim N Xu = 0, as already shown for Theil’s estimator. 
The estimator (6.51) actually equals the 2SLS estimator defined in (6.44), since 
X = X'Py. 
This IV approach will lead to correct standard errors and can be extended to non- 
linear settings. 


6.4.4. Alternatives to Standard IV Estimators 


The IV-based optimal GMM and 2SLS estimators presented in Section 6.4.2 are the 
standard estimators used when regressors are endogenous. Chernozhukov and Hansen 
(2005) present an IV estimator for quantile regression. 
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Here we briefly discuss leading alternative estimators that have received renewed 
interest given the poor finite-sample properties of 2SLS with weak instruments detailed 
in Section 4.9. We focus on single-equation linear models. At this stage there is no 
method that is relatively efficient yet has small bias in small samples. 


Limited-Information Maximum Likelihood 


The limited-information maximum likelihood (LIML) estimator is obtained by 
joint ML estimation of the single equation (6.46) plus the reduced form for the en- 
dogenous regressors in the right-hand side of (6.46) assuming homoskedastic normal 
errors. For details see Greene (2003, p. 402) or Davidson and MacKinnon (1993, 
pp. 644-651). More generally the k class of estimators (see, for example, Greene, 
2003, p. 403) includes LIML, 2SLS, and OLS. 

The LIML estimator due to Anderson and Rubin (1949) predates the 2SLS esti- 
mator. Unlike 2SLS, the LIML estimator is invariant to the normalization used in a 
simultaneous equations system. Moreover, LIML and 2SLS are asymptotically equiv- 
alent given homoskedastic errors. Yet LIML is rarely used as it is more difficult to 
implement and harder to explain than 2SLS. Bekker (1994) presents small-sample re- 
sults for LIML and a generalization of LIML. See also Hahn and Hausman (2002). 


Split-Sample IV 


Begin with Basmann’s interpretation of 2SLS as an IV estimator given in (6.51). Sub- 
stituting for y from (6.46) yields 


B= B+(%'X) Ru. 


By assumption plim N~'Z’u = 0 so plim N —!X’u = 0 and B is consistent. However, 
correlation between X and u, the reason for IV estimation, means that x = P7X is 
correlated with u. Thus E[X’u] 0, which leads to bias in the IV estimator. This bias 
arises from using K = ZII rather than X = ZI as the instrument. 

An alternative is to instead use as instrument predictions X, which have the property 
that E[X’u] = 0 in addition to plim N Xu = 0, and use estimator 


B = QX X'y. 


Since E[X’u] = 0 does not imply E[(X’X)~!X’ u] = 0, this estimator will still be bi- 
ased, but the bias may be reduced. 

Angrist and Krueger (1995) proposed obtaining such instruments by splitting the 
sample into two subsamples (yi, X1, Z1) and (y2, X2, Z2). The first sample is used 
to obtain estimate fl, from regression of Xi ¢ on Z. The second sample is used to 
obtain the IV estimator where the instrument C= ZMH, uses I, obtained from the 
separate first sample. Angrist and Krueger (1995) define the unbiased split-sample 
IV estimator as 


Bussiv = X) i Xy. 
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The split-sample IV estimator Bssw = XX) yə is a variant based on Theil’s 
interpretation of 2SLS. These estimators have finite-sample bias toward zero, unlike 
2SLS, which is biased toward OLS. However, considerable efficiency loss occurs be- 
cause only half the sample is used at the final stage. 


Jackknife IV 


A more efficient variant of this estimator implements a similar procedure but generates 
instruments observation by observation. 

Let the subscript (—i) denote the leave-one-out operation that drops the ith obser- 
vation. Then for the ith observation we obtain estimate TI; from regression of X(_;) on 
Zi) and use as instrument > Kor TI. Repeating N times gives an instrument vector 
denoted X- i with ith row X This leads to the jackknife IV estimator 


Biv = (XX) XL pyr. 


This estimator was originally proposed by Phillips and Hale (1977). Angrist, 
Imbens and Krueger (1999) and Blomquist and Dahlberg (1999) called it a jackknife 
estimator since the jackknife (see Section 11.5.5) is a leave-one-out method for bias 
reduction. The computational burden of obtaining the N jackknife predicted values x’ 
is modest by use of the recursive formula given in Section 11.5.5. The Monte Carlo 
evidence given in the two recent papers is mixed, however, indicating a potential for 
bias reduction but also an increase in the variance. So the jackknife version may not be 
better than the conventional version in terms of mean-square error. The earlier paper 
by Phillips and Hale (1977) presents analytical results that the finite-sample bias of the 
JIV estimator is smaller than that of 2SLS only for appreciably overidentified models 
with r > 2(K + 1). See also Hahn, Hausman and Kuersteiner (2001). 


Independently Weighted 2SLS 


A related method to split-sample IV is the independently weighted GMM estimator of 
Altonji and Segal (1996) given in Section 6.3.5. Splitting the sample into G groups 
and specializing to linear IV yields the independently weighted IV estimator 


a Neen ae 7 
Bow = g 2o [52S 2X T X, ZS oZ Ye 
p= 


where S- g) is computed using S defined in (6.40) except that observations from the 
gth group are excluded. In a panel application Ziliak (1997) found that the indepen- 
dently weighted IV estimator performed much better than the unbiased split-sample 
IV estimator. 


6.5. Nonlinear Instrumental Variables 


Nonlinear IV methods, notably nonlinear 2SLS proposed by Amemiya (1974), per- 
mit consistent estimates of nonlinear regression models in situations where the NLS 
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estimator is inconsistent because to regressors are correlated with the error term. We 
present these methods as a straightforward extension of the GMM approach for linear 
models. 

Unlike the linear case the estimators have no explicit formula, but the asymptotic 
distribution can be obtained as a special case of the Section 6.3 results. This section 
presents single-equation results, with systems results given in Section 6.10.4. A fun- 
damentally important result is that a natural extension of Theil’s 2SLS method for 
linear models to nonlinear models can lead to inconsistent parameter estimates (see 
Section 6.5.4). Instead, the GMM approach should be used. 

An alternative nonlinearity can arise when the model for the dependent variable is 
a linear model, but the reduced form for the endogenous regressor(s) is a nonlinear 
model owing to special features of the dependent variable. For example, the endoge- 
nous regressor may be a count or a binary outcome. In that case the linear methods 
of the previous section still apply. One approach is to ignore the special nature of the 
endogenous regressor and just do regular linear 2SLS or optimal GMM. Alternatively, 
obtain fitted values for the endogenous regressor by appropriate nonlinear regression, 
such as Poisson regression on all the instruments if the endogenous regressor is a count, 
and then do regular linear IV using this fitted value as the instrument for the count, fol- 
lowing Basmann’s approach. Both estimators are consistent, though they have different 
asymptotic distributions. The first simpler approach is the usual procedure. 


6.5.1. Nonlinear GMM with Instruments 


Consider the quite general nonlinear regression model where the error term may be 
additive or nonadditive (see Section 6.2.2). Thus 


uj = r(yi, Xi, B), (6.52) 
where the nonlinear model with additive error is the special case 
ui = yi — 8(Xi, B), (6.53) 


where g(-) is a specified function. The estimators given in Section 6.2.2 are inconsis- 
tent if E[u; |x;] 40. 
Assume the existence of r instruments z, where r > K, that satisfy 


E[u;|z;] = 0. (6.54) 


This is the same conditional moment condition as in the linear case, except that u; = 
r(yi, X;, B) rather than u; = y; — x; 8. 


Nonlinear GMM Estimator 
By the law of iterated expectations, (6.54) leads to 
E[z;u;] = 0. (6.55) 


The GMM estimator minimizes the quadratic form in the corresponding sample mo- 
ment condition. 
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In matrix notation let u denote the N x 1 error vector with ith entry u; given in 
(6.52) and let Z to be an N x r matrix of instruments with ith row z. Then }°; Ziu; = 
Z'u and the GMM estimator in the nonlinear IV model Bgyyy minimizes 


1 1 
On(B) = (22) Wy (2) ; (6.56) 


where Wy is anr x r weighting matrix. Unlike linear GMM, the first-order conditions 
do not lead to a closed-form solution for Bomm- 


Distribution of Nonlinear GMM Estimator 
The GMM estimator is consistent for 6 given (6.54) and asymptotically normally dis- 
tributed with estimated asymptotic variance 
F [Boum] = N [D'ZWyZ'D]' [D'ZWySWyZ'D] [D'ZWyZ/D] (6.57) 


using the results from Section 6.3.3 with h(-) = zu, where Sis given in the following 
and Dis an N x K matrix of derivatives of the error term 


ou 


pa 
0p 


(6.58) 


Boum 


With nonadditive errors, D has ith row ar(y;, Xi, B)/ 08'|5. With additive errors, D 


has ith row 0g(x;, )/aB'\; , ignoring the minus sign that cancels out in (6.57). 
For independent heteroskedastic errors, 


S=! Y Rut, (6.59) 


similar to the linear case except now t; = r (yi, X, B) ort; = y; — g(x, B). 

The asymptotic variance of the GMM estimator in the nonlinear model is therefore 
the same as that in the linear case given in (6.39), with the change that the regressor 
matrix X is replaced by the derivative du/ 08'|5. This is exactly the same change as 
observed in Section 5.8 in going from linear to nonlinear least squares. By analogy 
with linear IV, the rank condition for identification is that plim N~'Z/ du/d’ | Bo is 
of rank K and the weaker order condition is thatr > K. 


6.5.2. Different Nonlinear GMM Estimators. 


Two leading specializations of the GMM estimator, which differ in the choice of 
weighting matrix, are optimal GMM that sets Wy = S~! and nonlinear two-stage least 
squares (NL2SLS) that sets Wy = (Z'Z) |. Table 6.3 summarizes these estimators 
and their associated variance matrices, assuming independent heteroskedastic errors, 
and gives results for general Wy and results for nonlinear IV in the just-identified 
model. 
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Table 6.3. GMM Estimators in Nonlinear IV Model and Their Asymptotic Variance‘ 


Estimator Definition and Asymptotic Variance 

GMM Qovm(B) = wZWyZ'u 

(general Wy) VIB] = pee Dy 1[D/ZWySWyZ'D][D'ZWyZ'D]- l 
Optimal GMM Qocmm(6) = u TS | 

(Wy =S7) Vial = oe ae 

NL2SLS Onirsis(B) = wZ(Z'Z) Zu 


(Wy=(N-'Z/Zz)"') VIB] = NID'Z(Z' Z) ZD DAZD SZ ZZD] 
x D'ZAZ' ZZD]! 
VIB] = s? [DZ Z) ZD]! if homoskedastic errors 
NLIV Bni solves Z'u = 0 
(just-identified) VIB] = ND- SOZ! 


4 Equations are for a nonlinear regression model with error u defined in (6.53) or (6.52) and instruments Z. D 
is the derivative of the error vector with respect to 8’ evaluated at B and simplifies for models with additive 
error to the derivative of the conditional mean function with respect to 3’ evaluated at B. S is defined in (6.59). 
All variance matrix estimates assume errors that are independent across observations and heteroskedastic, aside 
from the simplification for homoskedastic errors given for the NL2SLS estimator. 


Nonlinear Instrumental Variables 


In the just-identified case one can directly use the sample moment conditions corre- 
sponding to (6.55). This yields the method of moments estimator in the nonlinear 
IV model Gy; ;y that solves 


1 N 
= 5 z;u;— 0, (6.60) 
N i=l 


or equivalently Z’u = 0 with asymptotic variance matrix given in Table 6.3. 

Nonlinear estimators are often computed using iterative methods that obtain an op- 
timum to an objective function rather than solve nonlinear systems of estimating equa- 
tions. For the just-identified case Bni can be computed as a GMM estimator mini- 
mizing (6.56) with any choice of weighting matrix, most simply Wy = I, leading to 
the same estimate. 


Optimal Nonlinear GMM 


For overidentified models the optimal GMM estimator uses weighting matrix Wy = 
S-!. The optimal GMM estimator in the nonlinear IV model Boga therefore 


minimizes 
On(B) = (02) (2). (6.61) 


The estimated asymptotic variance matrix given in Table 6.3 is of relatively simple 
form as (6.57) simplifies when Wy = S“!. 
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As in the linear case the optimal GMM estimator is a two-step estimator when errors 
are heteroskedastic. In computing the estimated variance one can use S as presented 
in Table 6.3, but it is more common to instead use an estimator S, say, that is also 
computed using (6.59) but evaluates the residual at the optimal GMM estimator rather 
than the first-step estimate used to form Sin (6.61). 


Nonlinear 2SLS 


A special case of the GMM estimator with instruments sets Wy = (N~'Z’Z)' in 
(6.56). This gives the nonlinear two-stage least-squares estimator Byr2sųs that 
minimizes 


On(B) = TuZZZ Zu. (6.62) 


This estimator has the attraction of being the optimal GMM estimator if errors are 
homoskedastic, as then S= ZZ /N, where s? is a consistent estimate of the constant 
V[u]z] so S-! isa multiple of (Z’Z)~!. 

With homoskedastic error this estimator has the simpler estimated asymptotic vari- 
ance given in Table 6.3, a result often given in textbooks. However, in microecono- 
metrics applications it is common to permit heteroskedastic errors and use the more 
complicated robust estimate also given in Table 6.3. 

The NL2SLS estimator, proposed by Amemiya (1974), was an important precursor 
to GMM. The estimator can be motivated along similar lines to the first motivation 
for linear 2SLS given in Section 6.4.3. Thus premultiply the model error u by the 
instruments Z’ to obtain Z’u, where E[Z’u] = 0 since E[u|Z] = 0. Then do nonlinear 
GLS regression. Assuming homoskedastic errors this minimizes 


Oy(B) =UZ[o°ZZ)'Z'u, 


as V[u|Z] = oI implies V[Z'u|Z] = o°Z’Z. This objective function is just a scalar 
multiple of (6.62). 

The Theil two-stage interpretation of linear 2SLS does not always carry over to non- 
linear models (see Section 6.5.4). Moreover, NL2SLS is clearly a one-step estimator. 
Amemiya chose the name NL2SLS because, as in the linear case, it permits consistent 
estimation using instrumental variables. The name should not be taken literally, and 
clearer terms are nonlinear IV or nonlinear generalized IV estimation. 


Instrument Choice in Nonlinear Models 


The preceding estimators presume the existence of instruments such that E[u|z] = 0 
and that estimation is best if based on the unconditional moment condition E[zu] = 0. 

Consider the nonlinear model with additive error so that u = y — g(x, B). To be 
relevant the instrument must be correlated with the regressors x; yet to be valid it 
cannot be a direct causal variable for y. From the variance matrix given in (6.57) it is 
actually correlation of z with dg/0 rather than just x that matters, to ensure that DZ 
should be large. Weak instruments concerns are just as relevant here as in the linear 
case studied in Section 4.9. 
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Given likely heteroskedasticity the optimal moment condition on which to base es- 
timation, given E[u|z] = 0, is not E[zu] = 0. From Section 6.3.7, however, the optimal 
moment condition requires additional moment assumptions that are difficult to make, 
so it is standard to use E[zu] = 0 as has been done here. 

An alternative way to control for heteroskedasticity is to base GMM estimation on 
an error term defined to be close to homoskedastic. For example, with count data rather 
than use u = y — exp(x’), work with the standardized error u* = u/,/exp (x'G) 
(see Section 6.2.2). Note, however, that E[u*|z] = 0 and E[u|z] = 0 are different 
assumptions. 

Often just one component of x is correlated with u. Then, as in the linear case, the 
exogenous components can be used as instruments for themselves and the challenge is 
to find an additional instrument that is uncorrelated with u. There are some nonlinear 
applications that arise from formal economic models as in Section 6.2.7, in which case 
the many subcomponents of the information set are available as instruments. 


6.5.3. Poisson IV Example 


The Poisson regression model with exogenous regressors specifies E[y|x] = exp(x’@). 
This can be viewed as a model with additive error u = y — exp(x’). If regressors 
are endogenous then E[u|x] 4 0 and the Poisson MLE will then be inconsistent. Con- 
sistent estimation assumes the existence of instruments z that satisfy E[u|z] = 0 or, 
equivalently, 


ELy — exp(x’B)|z] = 0. 


The preceding results can be directly applied. The objective function is 


Qy(B) = [a > ziu | Wy [a DD ziti i 


where u; = y; — exp(x; 8). The first-order conditions are then 
p exp(x,/3)x,7; | Wy H Zi(Yi — exp(x,/3))| = 0. 


The asymptotic distribution is given in Table 6.3, with DZ = $; eX Oxia since 
dg/03 = exp(x’3)x and S defined in (6.39) with t; = y; — exp(x,3). The opti- 
mal GMM and NL2SLS estimators differ in whether the weighting matrix is S- or 
(N-!Z/Z)~', where Z'Z = Ñ; ziz. 

An alternative consistent estimator follows the Basmann approach. First, estimate 
by OLS the reduced form x; = Iz; + v; giving K predictions x; = Tiz;. Second, es- 
timate by nonlinear IV as in (6.60) with instruments x; rather than z;. Given the OLS 
formula for II this estimator solves 


p= xiz] ba nt] [> - exptx,3))z; | =0. 


This estimator differs from the NL2SLS estimator because the first term in the left- 
hand side differs. Potential problems with instead generalizing Theil’s method for lin- 
ear models are detailed in the next section. 
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Similar issues arise in nonlinear models other than Poisson regression, such as mod- 
els for binary data. 


6.5.4. Two-Stage Estimation in Nonlinear Models 


The usual interpretation of linear 2SLS can fail in nonlinear models. Thus suppose y 
has mean g(x, 6) and there are instruments z for the regressors x. Then OLS regression 
of x on instruments z to get fitted values x followed by NLS regression of y on g(x, 9) 
can lead to inconsistent parameter estimates of 3, as we now demonstrate. Instead, one 
needs to use the NL2SLS estimator presented in the previous section. 

Consider the following simple model, based on one presented in Amemiya (1984), 
that is nonlinear in variables though still linear in parameters. Let 


y = Bx? +u, (6.63) 
X=TZ+0, 


where the zero-mean errors u and v are correlated. The regressor x? is endogenous, 


since x is a function of v and by assumption u and v are correlated. As a result the 
OLS estimator of £ is inconsistent. If z is generated independently of the other random 
variables in the model it is a valid instrument as it is clearly then independent of u but 
correlated with x. 

The IV estimator is Bw = ($; zix®! J; ziyi. This can be implemented by a reg- 
ular IV regression of y on x? with instrument z. Some algebra shows that, as expected, 
Bw equals the nonlinear IV estimator defined in (6.60). 

Suppose instead we perform the following two-stage least-squares estimation. 
First, regress x on z to get ¥=%z and then regress y on ¥2. Then By¢5 = 
(>, *7%7)~! 9°, Eyi, where X? is the square of the prediction X; obtained from OLS 
regression of x on z. This yields an inconsistent estimate. Adapting the proof for the 
linear case in Section 6.4.3 we have 


= Bx? + ui 
A2 
= Bx; + wi, 


where w; = B(x? — 3?) + ui. An OLS regression of y; on 2? 


is inconsistent for 6 
because the regressor £? is asymptotically correlated with the composite error term w;. 
Formally, (x? — y= = (nz, + 0;) — (Fz)? = T Z? + aik vi + v? — Rz? implies, 
using ai =z and some algebra, that plim N`! >, X?(x? — 3?) = plim N7! 
$; w°z7v? Æ 0 even if z; and v; are independent. Hence plim N~! >, x?w; 4 plim 
N- Y RA A <0. 

A variation that is consistent, however, is to regress x > rather than x on z at the first 
stage and use the prediction x? T (x)? at the second stage. It can Pe. shown that this 
equals By. The instrument for x? needs to be the fitted value for x? rather than the 
square of the fitted value for x. 

This example generalizes to other nonlinear models where the nonlinearity is in 
regressors only, so that 


y =g +u, 
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Table 6.4. Nonlinear Two-Stage Least-Squares Example‘ 


Estimator 
Variable OLS NL2SLS Two-Stage 
x 1.189 0.960 1.642 
(0.025) (0.046) (0.172) 
R? 0.88 0.85 0.80 


^ The dgp given in the text has true coefficient equal to one. The sample 
size is N = 200. 


where g(x) is a nonlinear function of x. Common examples are use of powers and nat- 
ural logarithm. Suppose E[u|z] = 0. Inconsistent estimates are obtained by regressing 
x on Z to get predictions x, and then regressing y on g(x). Consistent estimates can be 
obtained by instead regressing g(x) on z to get predictions g(x), and then regressing y 
on g(x) at the second stage. We use g(x) rather than g(x) as instrument for g(x). Even 
then the second-stage regression gives invalid standard errors as OLS output will use 
residuals « = y — g(x) 8 rather than t = y — g(x)’ B. It is best to directly use a GMM 
or NL2SLS command. 

More generally models may be nonlinear in both variables and parameters. Consider 
a single-index model with additive error, so that 


y =8xp)+u. 


Inconsistent estimates may be obtained by OLS of x on z to get predictions x, and then 
NLS regression of y on g(x’). Either GMM or NL2SLS needs to be used. Essentially, 
for consistency we want g(x’), not 9(x’3). 


NL2SLS Example 


We consider NL2SLS estimation in a model with a simple nonlinearity resulting from 
the square of an endogenous variable appearing as a regressor, as in the previous 
section. 

The dgp is (6.63), so y = Bx? +u and x = mz + v, where $ = 1, and m = 1, and 
z = 1 for all observations and (u, v) are joint normal with means 0, variances 1, and 
correlation 0.8. A sample of size 200 is drawn. Results are shown in Table 6.4. 

The nonlinearity here is quite mild with the square of x rather than x appearing as 
regressor. Interest lies in estimating its coefficient 6. The OLS estimator is inconsis- 
tent, whereas NL2SLS is consistent. The two-stage method where first an OLS regres- 
sion of x on z is used to form F and then an OLS regression of y on (x)? is performed 
that yields an estimate that is more than two standard errors from the true value of 
B = 1. The simulation also indicates a loss in goodness of fit and precision with larger 
standard errors and lower R?, similar to linear IV. 
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6.6. Sequential Two-Step m-Estimation 


Sequential two-step estimation procedures are estimation procedures where the es- 
timate of a parameter of ultimate interest is based on initial estimation of an un- 
known parameter. An example is feasible GLS when the error has conditional vari- 
ance exp(z'y). Given an estimate y of y, the FGLS estimator B solves eG — 
x. ‘By / exp(z;7). A second example is the Heckman two-step estimator given in Sec- 
tion 16.10.2. 

These estimators are attractive as they can provide a relatively simple way to obtain 
consistent parameter estimates. However, for valid statistical inference it may be nec- 
essary to adjust the asymptotic variance of the second-step estimator to allow for the 
first-step estimation. We present results for the special case where the estimating equa- 
tions for both the first- and second-step estimators set a sample average to zero, which 
is the case for m-estimators, method of moments, and estimating equations estimators. 

Partition the parameter vector 0 into 0; and 2, with ultimate interest_ in 05. The 
model is estimated sequentially by first ene O | that solves yy hy; @) = = 0 and 
then, given 8 Ji, obtaining 0 that solves N7! au yO 1s 6)= = 0. In general the dis- 
tribution of 0 given estimation of 0 differs from: and is more complicated than, 
the distribution of 0> if 0; is known. Statistical inference is invalid if it fails to take 
into account this complication, except in some special cases given at the end of this 
section. 

The following derivation is given in Newey (1984), with similar results obtained by 
Murphy and Topel (1985) and Pagan (1986). The two-step estimator can be rewritten 
as a one-step estimator where (01, 02) jointly solve the equations 


N 
N! X` hi(w;, 81) = 0, (6.64) 


i=1 


N 
N“! X ha(w, 01, 02) = 0. 


i=1 


Defining 0 = (6, 64)’ andh; = (hj; _h,,)’, we can write the equations as 


N 
NY hw, 0) = 0 
i=1 


In this setup it is assumed that dim(h,) = dim(@,) and dim(h,) = dim(@>), so that the 
number of estimating equations equals the number of parameters. Then (6.64) is an 
estimating equations estimator or MM estimator. 

Consistency requires that plim N7! >=; h(w;, 0o) = 0, where 09 = [0! 10> O50]. This 
condition should be $ satisfied if 0 1 is consistent for 019 in the first step, and if second- 
step estimation of > with 019 known (rather than estimated by 01) would lead to 
a consistent estimate of @29. Within a method of moments framework we require 
E[h,;(@,)] = 0 and E[hz (01, 02)] = 0. We assume that consistency is established. 

For the asymptotic distribution we apply the general result that 


VN(@ — 0o) > N [0, Gp 'So(Go'y’], 
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where Go and So are defined in Proposition 6.1. Partition Go and Sọ in a similar way 
to the partitioning of 0 and h;. Then 


1 & Jah; /a0, 0 Gi 0 
= lim — E i 1 = 
Le 2 Pe J30, dha; ia & || 
using dhj;(9)/06, = 0 since hı;(0) is not a function of 02 from (6.64). Since Go, G11, 
and G% are square matrices 


zi 
c= Bo —1 mir 
-Gz GaG Gay 


Clearly, 
hyh; hyhy’ | [Sui Siz 
S Da hzhz'| [S2 Sao] 


The asymptotic variance of 6 is the (2, 2) submatrix of the variance matrix of 6. After 
some algebra, we get 


S22 + Gri [G7 S1 G7 1G, 


E 
Vidal Gz | -Ga G7 Sn — S21G7 G}, 


‘| Gz. (6.65) 

The usual computer output yields standard errors that are incorrect and understate 
the true standard errors, since v[ð]i is then assumed to be Gy S2G7 , which can be 
shown to be smaller than the true variance given in (6.65). 

There is no need to account for additional variability in the second-step caused by 
estimation in the first step in the special case that E[dhy;(@)/00,] = 0, as then G2; = 0 
and v[ð]i in (6.65) reduces to Gp So.G5,. 

A well-known example of G2; = 0 is FGLS. Then for heteroskedastic errors 


X2; (yi — X;O2) 


MEO 81) 


where V[y;|x;] = o7(x;, 01), and 


E[dho;(0)/301] = | xo, 012X02) dol, a 


o (xi, 01) 00; 


which equals zero since E[y;|x;] = x;@. Furthermore, for FGLS consistency of 0> 
does not require that 0, be consistent since E[h2;(@)] = 0 just requires that E[y;|x;] = 
x, >, which does not depend on 01. 

A second example of G2; = 0 is ML estimation with a block diagonal matrix so that 
E[3?L(0)/ 90,00,] = 0. This is the case for example for regression under normality, 
where 0; are the variance parameters and 0% are the regression parameters. 

In other examples, however, G2; 4 0 and the more cumbersome expression (6.65) 
needs to be used. This is done automatically by computer packages for some standard 
two-step estimators, most notably Heckman’s two-step estimator of the sample selec- 
tion model given in Section 16.5.4. Otherwise, vð] needs to be computed manually. 
Many of the components come from earlier estimation. In particular, G] S11GẸ;! is 
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the robust variance matrix of 6, and G5, S22G5, is the robust variance matrix esti- 
mate of 0 that incorrectly ignores the estimation error in 6). For data independent 
pyr i the subcomponents of the So submatrix are consistently a by S jk= 
Ny, hhg’ , J, k = 1,2. This leaves computation of Gai = = N'Y, 3h; /30', l 
as the main challenge. 

A recommended simpler approach is to obtain bootstrap standard errors (see Sec- 
tion 16.2.5), or directly jointly estimate 0; and @ in the combined model (6.64), as- 
suming access to a GMM routine. 

These simpler approaches can also be applied to sequential estimators that are 
GMM estimators rather than m-estimators. Then combining the two estimators will 
lead to a set of conditions more complicated than (6.64) and we no longer get (6.65). 
However, one can still bootstrap or estimate jointly rather than sequentially. 


6.7. Minimum Distance Estimation 


Minimum distance estimation provides a way to estimate structural parameters 0 that 
are a specified function of reduced form parameters m, given a consistent estimate 
R of 7. 

A standard reference is Ferguson (1958). Rothenberg (1973) applied this method 
to linear simultaneous equations models, though the alternative methods given in Sec- 
tion 6.9.6 are the standard methods used. Minimum distance estimation is most often 
used in panel data analysis. In the initial work by Chamberlain (1982, 1984) (see Sec- 
tion 22.2.7) he lets T be OLS estimates from linear regression of the current-period 
dependent variable on regressors in all periods. Subsequent applications to covariance 
structures (see Section 22.5.4) let T be estimated variances and autocovariances of the 
panel data. See also the indirect inference method (Section 12.6). 

Suppose that the relationship between q structural parameters and r > q reduced 
form parameters is that 779 = g(@). Further suppose that we have a consistent estimate 
T of the reduced form parameters. An obvious estimator is @ such that 7 = g(0), | but 
this is infeasible since q < r. Instead, the minimum distance (MD) estimator Ono 
minimizes with respect to @ the objective function 


Qn (0) = R — g(8)) Wn — 8(8)), (6.66) 


where Wy is anr xr weighting matrix. 
F T mo and Wy Ta Wo, where Wọ is finite positive semidefinite then 
Q KO A Qo(0) = (mo—g(0))Wo(Tmo—g(0)). It follows that 8o is locally identified 
if Rank[Wo x dg(0)/00'] = q, while consistency essentially requires that mo= g(00). 
For the MD estimator ~N N (yp — 00) 4 NTO, V[Oupll, where 


V[Omp] = (GhWoGo)” (Gp WoVI#]WoGo)(GjWoGo) |, (6.67) 


Go = 02(0)/00’ | 0y and it is assumed that the reduced form parameters 7 have limit 
distribution N(# — mo) > N[0,V[#]]. More efficient reduced form estimators lead 
to more efficient MD estimators, since smaller V[7] leads to smaller V[Ompl i in (6.67). 
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To obtain the result (6.67), begin with the following rescaling of the first-order 
conditions for the MD estimator: 


Gy(0WyVN(& — g(0)) = 0, (6.68) 


where Gy(@) = 3g(0)/30'. An exact first-order Taylor series expansion about 0 
yields 


J/NW@ — g(0)) = VN — To) — Gn(O*)VN(O — 00), (6.69) 


where @* lies between 0 and A and we have used g(0o) = To. Substituting (6.69) 
back into (6.68) and solving for VN (0 — 0) yields 


VNO — 0) = [Gy@Y Wy Gn (0I Gy OYW VNG — 70), (6.70) 


which leads directly to (6.67). 

For given reduced form estimator 7, the most efficient MD estimator uses weighting 
matrix Wy = VIRI! in (6.66). This estimator is called the optimal MD (OMD) 
estimator, and sometimes the minimum chi-square estimator following Ferguson 
(1958). 

A common alternative special case is the equally weighted minimum distance 
(EWMD) estimator, which sets Wy = I. This is less efficient than the OMD estima- 
tor, but it does not have the finite-sample bias problems analogous to those discussed 
in Section 6.3.5 that arise when the optimal weighting matrix is used. The EWMD es- 
timator can be simply obtained by NLS regression of 7 jong KO j=l,...,r, since 
minimizing (7 — 2(0))’ (7 — 2(0)) yields the same first-order conditions as those in 
(6.68) with Wy = I. 

The maximized value of the objective function for the OMD is chi-squared dis- 
tributed. Specifically, 


R — g@omv)) VIRI GE — g(@omp)) (6.71) 


is asymptotically distributed as x?(r — q) under Ho : g(@9) = To. This provides a 
model specification test analogous to the OIR test of Section 6.3.8. 

The MD estimator is qualitatively similar to the GMM estimator. The GMM frame- 
work is the standard one employed. MD estimation is most often used in panel studies 
of covariance structures, since then 7 comprises easily estimated sample moments 
(variances and covariances) that can then be used to obtain 0. 


6.8. Empirical Likelihood 


The MM and GMM approaches do not require complete specification of the con- 
ditional density. Instead, estimation is based on moment conditions of the form 
E[h(y, x, 0)] = 0. The empirical likelihood approach, due to Owen (1988), is an alter- 
native estimation procedure based on the same moment condition. 

An attraction of the empirical likelihood estimator is that, although it is asymptoti- 
cally equivalent to the GMM estimator, it has different finite-sample properties, and in 
some examples it outperforms the GMM estimator. 
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6.8.1. Empirical Likelihood Estimation of Population Mean 


We begin with empirical likelihood in the case of a scalar iid random variable y 
with density f(y) and sample likelihood function []; f();). The complication con- 
sidered here is that the density f(y) is not specified, so the usual ML approach is not 
possible. 

A completely nonparametric approach seeks to estimate the density f(y) evaluated 
at each of the sample values of y. Let m; = f(y;) denote the probability that the ith 
observation on y takes the realized value y;. Then the goal is to maximize the so- 
called empirical likelihood function []; 7;, or equivalently to maximize the empirical 
log-likelihood function N7! >>, Inz;, which is a multinomial model with no structure 
placed on z;. This log-likelihood is unbounded, unless a constraint is placed on the 
range of values taken by 7;. The normalization used is that }°; 2; = 1. This yields the 
standard estimate of the cumulative distribution function in the fully nonparametric 
case, as we now demonstrate. 

The empirical likelihood estimator maximizes with respect to m and 7 the 
Lagrangian 


1 N N 
Lam n= 5 D Inz; — 1 (£ Ti — ) l (6.72) 


where m = [z...7y]’ and ņ is a Lagrange multiplier. Although the data y; do not 
explicitly appear in (6.72) they appear implicitly as 2; = f(y;). Setting the derivatives 
with respect to 7; (i = 1,..., N), and 7 to zero and solving yields 7; = 1/N and 7 = 
1. Thus the estimated density function F( y) has mass 1/N at each of the realized values 
yi, i =1,..., N. The resulting distribution function is F(y) = N7! xy 1(y < yi), 
where 1(A) = 1 if event A occurs and 0 otherwise. FO) is just the usual empirical 
distribution function. 

Now introduce parameters. As a simple example, suppose we introduce the moment 
restriction that E[y — u] = 0, where u is the unknown population mean. In the empir- 
ical likelihood context this population moment is replaced by a sample moment, where 
the sample moment weights sample values by the probabilities z;. Thus we introduce 
the constraint that 5°; 7;(y; — u) = 0. The Lagrangian for the maximum empirical 
likelihood estimator is 


1 N N N 
Lam na n= D -=n (> Ti — ) = 2 mO — p), (6.73) 


where n and à are Lagrange multipliers. 

Begin by differentiating the Lagrangian with respect to 7;(i = 1,..., N), n, and 
A but not u. Setting these derivatives to zero yields equations that are functions 
of u. Solving leads to the solution m; = 7;(j) and hence an empirical likelihood 
No >>; In 7; (u) that is then maximized with respect to u. This solution method leads 
to nonlinear equations that need to be solved numerically. 

For this particular problem an easier way to solve for u is to note that the max- 
imized value of L(7, n, à, u) must be less than or equal to N7! >>; In N7!, since 
this is the maximized value without the last constraint. However, £(7, n, à, u) equals 
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N! ln N7! if m; =1/N and f = N7!)°, y; = J. So the maximum empirical 
likelihood estimator of the population mean is the sample mean. 


6.8.2. Empirical Likelihood Estimation of Regression Parameters 


Now consider regression data that are iid over i. The only structure placed on the 
model are r moment conditions 


E[h(w;, 8)] = 0, (6.74) 


where h(-) and w; are defined in Section 6.3.1. For example, h(w, 8) = x(y — x’) for 
OLS estimation and h(y, x, 0) = (0g/00)(y — g(x, 0)) for NLS estimation. 

The empirical likelihood approach maximizes the empirical likelihood function 
N7! >>; In 7; subject to the constraint `; 2; = 1 (see (6.72)) and the additional sam- 
ple constraint based on the population moment condition (6.74) that 


N 
mh(w;, 9) = 0. (6.75) 


i=l 


Thus we maximize with respect to 7,7, A, and 0 


N N N 
Lev (a, n, A, 0) = x D -=n bs Ti — ) -X D 6), (6.76) 
where the Lagrangian multipliers are a scalar 7 and column vector A of the same 
dimension as h(-). 
First, concentrate out the N parameters 71, ..., ry. Differentiating L(z, n, A, 0) 
with respect to 7; yields 1/(Nz;) — n — A'h; = 0. Then we obtain n = 1 by multiply- 
ing by 7; and summing over i and using }_; z;h; = 0. It follows that 


1 
N(1 + X'h(w;, 8)) 


The problem is now reduced to a maximization problem with respect to (r + q) vari- 
ables A and 0, the Lagrangian multipliers associated with the r moment conditions 
(6.74), and the q parameters 0. 

Solution at this stage requires numerical methods, even for just-identified mod- 
els. One can maximize with respect to @ and A the function N`! >>; In[1/N(1 + 
A'h(w;, 9))]. 

Alternatively, first concentrate out A. Differentiating L(n (0, A), n, A) with respect 
to A yields }°; z;h; = 0. Define A(@) to be the implicit solution to the system of 
dim(A) equations 


7 (0, A) = 


(6.77) 


N 
D NOFA TE 7 lla 


In implementation numerical methods are needed to obtain A(9). Then (6.77) becomes 


1 
mO = TENOR, 0) ene 
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By substituting (6.78) into the empirical likelihood function N`! X, In z;, the empir- 
ical log-likelihood function evaluated at 8 becomes 


N 
Lel) = —N7! $ InN + AY H(w;, 0). 
i=] 


The maximum empirical likelihood (MEL) estimator Onie. maximizes this function 
with respect to 8. 
Qin and Lawless (1994) show that 


VN (Ort. — 90) > NTO, A(0)'B(00)A(00)7"], 


where A(8o) = plimE[3h(0)/30'lo,] and B(00) = plimE[h(0)h(0V|o,]. This is the 
same limit distribution as the method of moments (see (6.13)). In finite samples Oye, 
differs from Ocgmm, however, and inference is based on sample estimates 


5 -3h 
A= ee, Ting 


a EN ~ ~, 
B=) > th Ohi) 


eae) 
0 


that weight by the estimated probabilities 77; rather than the proportions 1/N. 

Imbens (2002) provides a recent survey of empirical likelihood that contrasts em- 
pirical likelihood with GMM. Variations include replacing N`! X`, Inz; in (6.26) 
by N`! 5°, 7; Inz;. Empirical likelihood is computationally more burdensome; see 
Imbens (2002) for a discussion. The advantage is that the asymptotic theory provides 
a better finite-sample approximation to the distribution of the empirical likelihood es- 
timator than it does to that for the GMM estimator. This is pursued further in Sec- 
tion 11.6.4. 


6.9. Linear Systems of Equations 


The preceding estimation theory covers single-equation estimation methods used in 
the majority of applied studies. We now consider joint estimation of several equations. 
Equations linear in parameters with an additive error are presented in this section, with 
extensions to nonlinear systems given in the subsequent section. 

The main advantage of joint estimation is the gain in efficiency that results from 
incorporation of correlation in unobservables across equations for a given individual. 
Additionally, joint estimation may be necessary if there are restrictions on parameters 
across equations. With exogenous regressors systems estimation is a minor extension 
of single-equation OLS and GLS estimation, whereas with endogenous regressors it is 
single-equation IV methods that are adapted. 

One leading example is systems of equations such as those for observed demand of 
several commodities at a point in time for many individuals. For seemingly unrelated 
regression all regressors are exogenous whereas for simultaneous equations models 
some regressors are endogenous. 
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A second leading example is panel data, where a single equation is observed at 
several points in time for many individuals, and each time period is treated as a separate 
equation. By viewing a panel data model as an example of a system it is possible to 
improve efficiency, obtain panel robust standard errors, and derive instruments when 
some regressors are endogenous. 

Many econometrics texts provide lengthy presentations of linear systems. The treat- 
ment here is very brief. It is mainly directed toward generalization to nonlinear systems 
(see Section 6.10) and application to panel data (see Chapters 21-23). 


6.9.1. Linear Systems of Equations 


The single-equation linear model is given by y; = x; 6 + u;, where y; and u; are scalars 
and x; and 8 are column vectors. The multiple-equation linear model, or multivari- 
ate linear model, with G dependent variables is given by 


y; = X;ß + u;, i=l,...,N, (6.79) 


where y; and u; are G x 1 vectors, X; isa G x K matrix, and 8 is a K x 1 column 
vector. 

Throughout this section we make the cross-section assumption that the error vector 
u; is independent over i, so Efu;u’] = 0 for i 4 j. However, components of u; for 
given i may be correlated and have variances and covariances that vary over i, leading 
to conditional error variance matrix for the ith individual 


Q; = E[u;u;|X;]. (6.80) 


There are various ways that a multiple-equation model may arise. At one extreme 
the seemingly unrelated equations model combines G equations, such as demands for 
different consumer goods, where parameters vary across equations and regressors may 
or may not vary across equations. At the other extreme the linear panel data combines 
G periods of data for the same equation, with parameters that are constant across 
periods and regressors that may or may not vary across periods. These two cases are 
presented in detail in Sections 6.9.3 and 6.9.4. 

Stacking (6.79) over N individuals gives 


yı Xı u 
=| : |B+] : |, (6.81) 
yn Xy uy 
or 
y=XG+u, (6.82) 


where y and u are NG x 1 vectors and X isa NG x K matrix. 

The results given in the following can be obtained by treating the stacked model 
(6.82) in the same way as in the single-equation case. Thus the OLS estimator is B = 
(X’X)~!X’y and in the just-identified case with instrument matrix Z the IV estimator 
is B = (Z'X)~'Z’y. The only real change is that the usual cross-section assumption of 
a diagonal error variance matrix is replaced by assumption of a block-diagonal error 
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matrix. This block-diagonality needs to be accommodated in computing the estimated 
variance matrix of a systems estimator and in forming feasible GLS estimators and 
efficient GMM estimators. 


6.9.2. Systems OLS and FGLS Estimation 


An OLS estimation of the system (6.82) yields the systems OLS estimator 
(X’X)~'X’y. Using (6.81) it follows immediately that 


N =l y 
Bsors = e xx | X X;yi. (6.83) 
i=l izi 


The estimator is asymptotically normal and, assuming the data are independent over i, 
the usual robust sandwich result applies and 


-l wy -1 


N 
V [Bsors] = [ox x| POAN, e xx | : (6.84) 
i=l 


where U; = y; — X;p. This variance matrix estimate permits conditional variances and 
covariances of the errors to differ across individuals. 

Given correlation of the components of the error vector for a given individual, 
more efficient estimation is possible by GLS or FGLS. If observations are indepen- 
dent over i, the systems GLS estimator is systems OLS applied to the transformed 
system 
Q Py = OF'°K, + u, (6.85) 


L 
where Q; is the error variance matrix defined in (6.80). The transformed error Q; 1u u; 
has mean zero and variance 


E (07u) (9; '7u;) x | = Q'E [u; i [X;] o; -1/2 


= Pogo" 


= 


So the transformed system has errors that are homoskedastic and uncorrelated over G 
equations and OLS is efficient. 

To implement this estimator, a model for Q; needs to be specified, say Q; = Q; (y). 
Then perform systems OLS estimation in the transformed system where Q; is replaced 
by 0;(4), where ¥ is a consistent estimate of y. This yields the systems feasible GLS 
(SFGLS) estimator 


N 
Baas =| XQ x| ae yi- (6.86) 
i=l 
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This estimator is asymptotically normal and to guard against possible misspecification 
of ;(y) we can use the robust sandwich estimate of the variance matrix 


-1 


N 
V [Gerais |= [$x cx] yx; TTA Q;7!X; [$ xax] , (6.87) 
i=l 


where Q; = QA). 

The most common specification used for Q; is to assume that it does not vary over 
i. Then Q; = Q is a G x G matrix that can be consistently estimated for finite G and 
N — œ by 


N 
Q=5 Sat, (6.88) 


where U; = y; — X; BsoLs- Then the SFGLS estimator is (6.86) with Q instead of Q, 
and after some algebra the SFGLS estimator can also be written as 


Bsrors = [x (Q' Iv) x] x'(Q"@ ly) y', (6.89) 


where & denotes the Kronecker product. The assumption that Q; = Q rules out, for 
example, heteroskedasticity over i. This is a strong assumption, and in many applica- 
tions it is best to use robust standard errors calculated using (6.87), which gives correct 
standard errors even if Q; does vary over i. 


6.9.3. Seemingly Unrelated Regressions 


The seemingly unrelated regressions (SUR) model specifies the gth of G equations 
for the ith of N individuals to be given by 


Yig = X;gbg + Uig, g=l,...,G,i=1,...,N, (6.90) 


where xj, are regressors that are assumed to be exogenous and 3, are K, x 1 param- 
eter vectors. For example, for demand data on G goods for N individuals, yj, may 
be the ith individual’s expenditure on good g or budget share for good g. In all that 
follows G is assumed fixed and reasonably small while N — oo. Note that we use the 
subscript order yj, as results then transfer easily to panel data with variable y;, (see 
Section 6.9.4). Other authors use the reverse order y,;. 

The SUR model was proposed by Zellner (1962). The term seemingly unrelated 
regressions is deceptive, as clearly the equations are related if the errors uj, in different 
equations are correlated. For the SUR model the relationship between y;, and yin is 
indirect; it comes through correlation in the errors across different equations. 

Estimation combines observations over both equations and individuals. For microe- 
conometrics applications, where independence over i is assumed, it is most convenient 
to first stack all equations for a given individual. Stacking all G equations for the ith 
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individual we get 


Yil x0 0 By Uil 
Vig 00x, Bg UiG 


which is of the form y; = X;@G-+ u; in (6.79), where y; and u; are G x 1 vectors 
with gth entries yiş and Uig, X; is a G x K matrix with gth row [0-- "Xio -- 0], and 
B =([G)... BG] isa K x 1 vector where K = K; +--+ Kg. Some authors instead 
first stack all individuals for a given equation, leading to different algebraic expressions 
for the same estimators. 

Given the definitions of X; and y; it is easy to show that Bess in (6.83) is 


-1 
ee N N 
By [om xx) | Vint Xi 


am = 
N N 
Be [ee xioxio | Dai XiGYiG 


so that systems OLS is the same as separate equation-by-equation OLS. As might be 
expected a priori, if the only link across equations is the error and the errors are treated 
as being uncorrelated then joint estimation reduces to single-equation estimation. 

A better estimator is the feasible GLS estimator defined in (6.86) using Q in (6.88) 
and statistical inference based on the asymptotic variance given in (6.87). This estima- 
tor is generally more efficient than systems OLS, though it can be shown to collapse 
to OLS if the errors are uncorrelated across equations or if exactly the same regressors 
appear in each equation. 

Seemingly unrelated regression models may impose cross-equation parameter 
restrictions. For example, a symmetry restriction may imply that the coefficient of 
the second regressor in the first equation equals the coefficient of the first regressor 
in the second equation. If such restrictions are equality restrictions one can easily 
estimate the model by appropriate redefinition of X; and 6 given in (6.79). For ex- 
ample, if there are two equations and the restriction is that G, = — 6; then define 
X; = [xi) — Xin] and G = G,. Alternatively, one can estimate using systems exten- 
sions of single-equation OLS and GLS with linear restrictions on the parameters. 

Also, in systems of equations it is possible that the variance matrix of the error 
vector u; is singular, as a result of adding-up constraints. For example, suppose yj, 
is the ith budget share, and the model is yjg = Œg + 2,3, + Uig, Where the same re- 
gressors appear in each equation. Then ` g Yig = 1 since budget shares sum to one, 
which requires }’,@ = 1, `, B; = 0, and >? Uig = 0. The last restriction means 
Q; is singular and hence noninvertible. One can eliminate one equation, say the last, 
and estimate the model by systems estimation applied to the remaining G — 1 equa- 
tions. Then the parameter estimates for the Gth equation can be obtained using the 
adding-up constraint. For example, @g = 1 — (@1 +--+ +@G-_1). It is also possible 
to impose equality restrictions on the parameters in this setup. A literature exists on 
methods that ensure that estimates obtained are invariant to the equation deleted; see, 
for example, Berndt and Savin (1975). 
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6.9.4. Panel Data 


Another leading application of systems GLS methods is to panel data, where a scalar 
dependent variable is observed in each of T time periods for N individuals. Panel data 
can be viewed as a system of equations, either T equations for N individuals or N 
equations for T time periods. In microeconometrics we assume a short panel, with T 
small and N — oo so it is natural to set it up as a scalar dependent variable y;,, where 
the gth equation in the preceding discussion is now interpreted as the tth time period 
and G =T. 
A simple panel data model is 


Yr =X, Bt un t=1,...,T,i=1,...,N, (6.92) 


a specialization of (6.90) with B now constant. Then in (6.79) the regressor matrix 
becomes X; = [x;1--- Xir]. After some algebra the systems OLS estimator defined in 
(6.83) can be reexpressed as 


T za T 
Brors = l > we 5 Xit Yit- (6.93) 


i=] t=1 t=1 


~ 
= 


This estimator is called the pooled OLS estimator as it pools or combines the cross- 
section and time-series aspects of the data. 

The pooled estimator is obtained simply by OLS estimation of y;; on x;;. However, 
if ui, are correlated over t for given i, the default OLS standard errors that assume 
independence of the error over both i and f are invalid and can be greatly downward 
biased. Instead, statistical inference should be based on the robust form of the co- 
variance matrix given in (6.84). This is detailed in Section 21.2.3. In practice models 
more complicated than (6.92) that include individual specific effects are estimated (see 
Section 21.2). 


6.9.5. Systems IV Estimation 


Estimation of a single linear equation with endogenous regressors was presented 
in Section 6.4. Now we extend this to the multivariate linear model (6.79) when 
E[u;|X;] 4 0. Brundy and Jorgenson (1971) considered IV estimation applied to the 
system of equations to produce estimates that are both consistent and efficient. 

We assume the existence of a G x r matrix of instruments Z; that satisfy E[u;|Z;] = 
0 and hence 


EIZ; (y; — X:B)] = 0. (6.94) 


These instruments can be used to obtain consistent parameter estimates using single- 
equation IV methods, but joint equation estimation can improve efficiency. The sys- 
tems GMM estimator minimizes 


N ! N 
On(B) = E Ziyi — xø) Wy È Ziyi — xø) l (6.95) 
i=l i=1 
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where Wy is anr x r weighting matrix. Performing some algebra yields 
Bscum = [X'ZWyZ'X] | [X'ZWyZ'y], (6.96) 


where X is an NG x K matrix obtained by stacking X,,..., Xy (see (6.81)) and Z 
is an NG x r matrix obtained by similarly stacking Z),..., Zy. The systems GMM 
estimator has exactly the same form as (6.37), and the asymptotic variance matrix is 
that given in (6.39). It follows that a robust estimate of the variance matrix is 


Vi8scmul = N [X'ZWyZ'X] | [X’ZWySWyZ'X] [X'ZWyZ'X] ', (6.97) 


where, in the systems case and assuming independence over i, 
N 
X ZATZ. (6.98) 


Several choices of weighting matrix receive particular attention. a 
First, the optimal systems GMM estimator is (6.96) with Wy = S~!, where S is 
defined in (6.98). The variance matrix then simplifies to 


“~_a” a -l 
VBoscmml = N [Xz zx] 


This estimator is the most efficient GMM estimator based on moment conditions 
(6.94). The efficiency gain arises from two factors: (1) systems estimation, which per- 
mits errors in different equations to be correlated, so that V[u;|Z;] is not restricted to 
being block diagonal, and (2) an allowance for quite general heteroskedasticity and 
correlation, so that Q; can vary over i. 

Second, the systems 2SLS estimator arises when Wy = (N~!Z’/Z)~'. Consider 
the SUR model defined in (6.91), with some of the regressors x;, now endogenous. 
Then systems 2SLS reduces to equation-by-equation 2SLS, with instruments z, for 
the gth equation, if we define the instrument matrix to be 


zZ; 9 0 
Z=| 0.0 l. (6.99) 
0 0 zg 
In many applications Z4 = Z2 = --- = Z; so that a common set of instruments is used 


in all equations, but we need not restrict analysis to this case. For the panel data model 
(6.92) systems 2SLS reduces to pooled 2SLS if we define Z; = [z;)--- Zir]. 

Third, suppose that V[u;|Z;] does not vary over i, so that V[u;|Z;] = Q. This is a 
systems analogue of the single-equation assumption of homoskedasticity. Then as with 
(6.88) a consistent estimate of Q is Q= N7 >>; uit’, where U; are residuals based 
on a consistent IV estimator such as systems 2SLS. Then the optimal GMM estimator 
is (6.96) with Wy = In ® Q. This estimator should be contrasted with the three-stage 
least-squares estimator presented at the end of the next section. 
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6.9.6. Linear Simultaneous Equations Systems 


The linear simultaneous equations model, introduced in Section 2.4, is a very impor- 
tant model that is often presented in considerable length in introductory graduate-level 
econometrics courses. In this section we provide a very brief self-contained summary. 
The discussion of identification overlaps with that in Chapter 2. Due to the presence 
of endogenous variables OLS and SUR estimators are inconsistent. Consistent estima- 
tion methods are placed in the context of GMM estimation, even though the standard 
methods were developed well before GMM. 

The linear simultaneous equations model specifies the gth of G equations for the 
ith of N individuals to be given by 


Yig = Lig Yg + YigBg + lig, g=l,...,G, (6.100) 


where the order of subscripts is that of Section 6.9 rather than Section 2.4, Z, is 
a vector of exogenous regressors that are assumed to be uncorrelated with the er- 
ror term u, and Y, is a vector that contains a subset of the dependent variables 
Vis <---> Ve-1s Ve+ls+++, YG Of the other G — 1 equations. Y, is endogenous as it is 
correlated with model errors. The model for the ith individual can equivalently be 
written as 


yB+zT = u;, (6.101) 


where y; = [y;1... yig] is a G x 1 vector of endogenous variables, z; is an r x 1 
vector of exogenous variables that is the union of Z;1, ..., Zig, U; = [ui]. . . Uig] is 
a G x 1 error vector, B is a G x G parameter matrix with diagonal entries unity, T is 
anr x G parameter matrix, and some of the entries in B and T are constrained to be 
unity. It is assumed that u; is iid over i with mean 0 and variance matrix X. 

The model (6.101) is called the structural form with different restrictions on B 
and T corresponding to different structures. Solving for the endogenous variables as a 
function of the exogenous variables yields the reduced form 


y = —ZTB'+u,B' (6.102) 
= z II + Vi, 
where II =—TB"! is the r x G matrix of reduced form parameters and v; = uB! 


is the reduced form error vector with variance Q = (B~!)/=B™!. 

The reduced form can be consistently estimated by OLS, yielding estimates of 
II = —-TB"! and Q = (B-'Y=B"!. The problem of identification, see Section 2.5, 
is one of whether these lead to unique estimates of the structural form parameters B, 
T and X. This requires some parameter restrictions since without restrictions B, I, 
and © contain G? more parameters than II and Q. A necessary condition for identi- 
fication of parameters in the gth equation is the order condition that the number of 
exogenous variables excluded from the gth equation must be at least equal to the num- 
ber of endogenous variables included. This is the same as the order condition given 
in Section 6.4.1. For example, if Yi, in (6.100) has one component, so there is one 
endogenous variable in the equation, then at least one of the components of x; must 
not be included. This will ensure that there are as many instruments as regressors. 
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A sufficient condition for identification is the stronger rank condition. This is given 
in many books such as Greene’s (2003) and for brevity is not given here. Other restric- 
tions, such as covariance restrictions, may also lead to identification. 

Given identification, the structural model parameters can be consistently estimated 
by separate estimation of each equation by two-stage least squares defined in (6.44). 
The same set of instruments z; is used for each equation. In the gth equation the sub- 
component Zig is used as instrument for itself and the remainder of z; is used as instru- 
ment for Yj¢. 

More efficient systems estimates are obtained using the three-stage least-squares 
(3SLS) estimator of Zellner and Theil (1962), which assumes errors are homoskedas- 
tic but are correlated across equations. First, estimate the reduced form coefficients IT 
in (6.102) by OLS regression of y on z. Second, obtain the 2SLS estimates by OLS re- 
gression of (6.100), where Y, is replaced by the reduced form predictions Y, — z'Íic. 
This is OLS regression of y, on T, and z,, or equivalently of y onX,, where £, are the 
predictions of Y, and z, from OLS regression on z. Third, obtain the 3SLS estimates 
by systems OLS regression of y, on X,, g = 1,..., G. Then from (6.89) 


oe eee ee cS eee 
sss = |F (a ® Iv) X| x (0 @ly)y. 


where X is obtained by first forming a block-diagonal matrix X; with diagonal blocks 
Xi1,---, XG and then stacking X;,..., Xy, and Q = N`! >>, 0; with U; the residual 
vectors calculated using the 2SLS estimates. 

This estimator coincides with the systems GMM estimator with Wy = Iy ® Q in 
the case that the systems GMM estimator uses the same instruments in every equation. 
Otherwise, 3SLS and systems GMM differ, though both yield consistent estimates if 
E[u;|z;] = 0. 


6.9.7. Linear Systems ML Estimation 


The systems estimators for the linear model are essentially LS or IV estimators with in- 
ference based on robust standard errors. Now additionally assume normally distributed 
iid errors, so that u; ~ NTO, Q]. 

For systems with exogenous regressors the resulting MLE is asymptotically equiva- 
lent to the GLS estimator. These estimators do use different estimators of Q and hence 
B, however, so that there are small-sample differences between the MLE and the GLS 
estimator. For example, see Chapter 21 for the random effects panel data model. 

For the linear SEM (6.101), the limited information maximum likelihood es- 
timator, a single-equation ML estimator, is asymptotically equivalent to 2SLS. The 
full information maximum likelihood estimator, the systems MLE, is asymptotically 
equivalent to 3SLS. See, for example, Schmidt (1976) and Greene (2003). 


6.10. Nonlinear Sets of Equations 


We now consider systems of equations that are nonlinear in parameters. For example, 
demand equation systems obtained from a specified direct or indirect utility may be 
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nonlinear in parameters. More generally, if a nonlinear model is appropriate for a de- 
pendent variable studied in isolation, for example a logit or Poisson model, then any 
joint model for two or more such variables will necessarily be nonlinear. 

We begin with a discussion of fully parametric joint modeling, before focusing on 
partially parametric modeling. As in the linear case we present models with exogenous 
regressors before considering the complication of endogenous regressors. 


6.10.1. Nonlinear Systems ML Estimation 


Maximum likelihood estimation for a single dependent variable was presented in Sec- 
tion 5.6. These results can be immediately applied to joint models of several dependent 
variables, with the very minor change that the single dependent variable conditional 
density f(9;|x;,@) becomes f(y;|X;, 0), where y; denotes the vector of dependent 
variables, X; denotes all the regressors, and 0 denotes all the parameters. 

For example, if yı ~ N[exp(x',3;), of] and y2 ~ N[exp(x) b2), o2] then a suitable 
joint model may be to assume that (y1, y2) are bivariate normal with means exp(x} 61) 
and exp(x}/3,), variances o? and o2, and correlation p. 

For data that are not normally distributed there can be challenges in specifying and 
selecting a sufficiently flexible joint distribution. For example, for univariate counts 
a standard starting model is the negative binomial (see Chapter 20). However, in ex- 
tending this to a bivariate or multivariate model for counts there are several alternative 
bivariate negative binomial models to choose from. These might differ, for example, 
as to whether the univariate conditional distribution or the univariate marginal distri- 
bution is negative binomial. In contrast the multivariate normal distribution has condi- 
tional and marginal distributions that are both normal. All of these multivariate nega- 
tive binomial distributions place some restrictions on the range of correlation such as 
restricting to positive correlation, whereas for the multivariate normal there is no such 
restriction. 

Fortunately, modern computational advances permit richer models to be specified. 
For example, a reasonably flexible model for correlated bivariate counts is to assume 
that, conditional on unobservables £; and £2, yı is Poisson with mean exp(x) 6; + £1) 
and y2 is Poisson with mean exp(x 3, + £2). An estimable bivariate distribution can 
be obtained by assuming that the unobservables £; and €2 are bivariate normal and in- 
tegrating them out. There is no closed-form solution for this bivariate distribution, but 
the parameters can nonetheless be estimated using the method of maximum simulated 
likelihood presented in Section 12.4. 

A number of examples of nonlinear joint models are given throughout Part 4 of the 
book. The simplest joint models can be inflexible, so consistency can rely on distribu- 
tional assumptions that are too restrictive. However, there is generally no theoretical 
impediment to specifying more flexible models that can be estimated using computa- 
tionally intensive methods. 

In particular, two leading methods for generating rich multivariate parametric mod- 
els are presented in detail in Section 19.3. These methods are given in the context of 
duration data models, but they have much wider applicability. First, one can introduce 
correlated unobserved heterogeneity, as in the bivariate count example just given. 
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Second, one can use copulas, which provide a way to generate a joint distribution 
given specified univariate marginals. 

For ML estimation a simpler though less efficient quasi-ML approach is to specify 
separate parametric models for yı and y2 and obtain ML estimates assuming inde- 
pendence of yı and yz but then do statistical inference permitting yı and yz to be 
correlated. This has been presented in Section 5.7.5. In the remainder of this section 
we consider such partially parametric approaches. 

The challenges became greater if there is endogeneity, so that a dependent variable 
in one equation appears as a regressor in another equation. Few models for nonlinear 
simultaneous equations exist, aside from nonlinear regression models with additive 
errors that are normally distributed. 


6.10.2. Nonlinear Systems of Equations 


For linear regression the movement from single equation to multiple equations is clear 
as the starting point is the linear model y = x' 8 + u and estimation is by least squares. 
Efficient systems estimation is then by systems GLS estimation. For nonlinear models 
there can be much more variety in the starting point and estimation method. 

We define the multivariate nonlinear model with G dependent variables to be 


r(yi, Xi, pB) = Uj, (6.103) 


where y; and u; are G x 1 vectors, r(y;, X;, 6) is a G x 1 vector function, X; is a 
G x L matrix, and G is a K x 1 column vector. Throughout this section we make the 
cross-section assumption that the error vector u; is independent over i, but components 
of u; for given i may be correlated with variances and covariances that vary over i. 

One example of (6.103) is a nonlinear seemingly unrelated regression model. 
Then the gth of G equations for the ith of N individuals is given by 


Tg(Yig» Xig, Bg) = Uig, g=1,...,G. (6.104) 


For example, Uig = Yig — exp(x;,3,). Then u; and r(-) in (6.103) are G x 1 vectors 
with gth entries uig and r,(-), X; is the same block-diagonal matrix as that defined in 
(6.91), and 8 is obtained by stacking 6; to Bg. 

A second example is a nonlinear panel data model. Then for individual i in 
period t 


rir, Xit, B) = Uit, t= EON % (6.105) 


Then u; and r(-) in (6.103) are T x 1 vectors, so G = T, with tth entries u;; and 
r(yit, Xit, B). The panel model differs from the SUR model by having the same func- 
tion r(-) and parameters @ in each period. 


6.10.3. Nonlinear Systems Estimation 
When the regressors X; in the model (6.103) are exogenous 


E[u,|X;] = 0, (6.106) 
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where u; is the error term defined in (6.103). We assume that the error term is inde- 
pendent over i, and the variance matrix is 


Q; = E[u;u;|X;]. (6.107) 


Additive Errors 


Systems estimation is a straightforward adaptation of systems OLS and FGLS estima- 
tion of the linear models when the nonlinear model is additive in the error term, so that 
(6.103) specializes to 


u = y; — g(Xi, p). (6.108) 


Then the systems NLS estimator minimizes the sum of squared residuals `; u;u;, 
whereas the systems FGNLS estimator minimizes 


OnB) =Y u; u, (6.109) 


where we specify a model Q;(y) for Q; and Q; = Q; (F). To guard against possible 
misspecification of Q; one can use robust standard errors that essentially require only 
that u; is independent and satisfies (6.106). Then the estimated variance of the systems 
FGNLS estimator is the same as that for the linear systems FGLS estimator in (6.87), 
with X; replaced by dg(y;, 3)/03’ la and now U; = y; — g(X; B). The estimated vari- 
ance of the simpler systems NLS estimator is obtained by additionally replacing 0; 

by Ig. 

The main challenge can be specifying a useful model for Q;. As an example, sup- 
pose we wish to jointly model two count data variables. In Chapter 20 we show 
that a standard model for counts, a little more general than the Poisson model, 
specifies the conditional mean to be exp(x’@) and the conditional variance to be a 
multiple of exp(x' 6). Then a joint model might specify u = [u; u2]', where u; = 
yı — exp(x; 61) and u2 = y2 — exp(x53,). The variance matrix Q; then has diagonal 
entries a exp(x;,3,) and a exp(x;,32), and one possible parameterization for the co- 
variance is a3[exp(x; 61) exp(x;,3,)]!/". The estimate Q; then requires estimates of 
Bi, Bo, 1, @2, and a3 that may be obtained from first-step single-equation estimation. 


Nonadditive Errors 


With nonadditive errors least-squares regression is no longer appropriate, as shown 
in the single-equation case in Section 6.2.2. Wooldridge (2002) presents consistent 
method of moments estimation. 

The conditional moment restriction (6.106) leads to many possible unconditional 
moment conditions that can be used for estimation. The obvious starting point is to 
base estimation on the moment conditions E[X‘u;] = 0. However, other moment con- 
ditions may be used. We more generally consider estimation based on K moment 
conditions 


E[R(X;, 3)'u;] = 0, (6.110) 
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where R(X;, 6) is a K x G matrix of functions of X; and 68. The specification of 
R(X,, 6) and possible dependence on 8 are discussed in the following. 

By construction there are as many moment conditions as parameters. The sys- 
tems method of moments estimator Bow solves the corresponding sample moment 
conditions 


1 f N 
7 DRG B)'r(yi, Xi, Bsmm) = 9, (6.111) 


where in practice R(X,, 6) is evaluated at a first-step estimate 3. This estimator is 
asymptotically normal with variance matrix 


-1 y E] 
F [svi] = Bi ] 2 RAOR, Ba ; (6.112) 


where D; = ðr; /38' z R; = RX;, B), and Ù; = r(y;, Xi, Baan 
The main issue is specification of R(X, 3) in (6.110). From Section 6.3.7, the most 
efficient estimator based on (6.106) specifies 


dr(y;, Xi, B) x| Q-! 
Jg 
In general the first expectation on the right-hand side requires strong distributional 
assumptions, making optimal estimation difficult. 
Simplification does occur, however, if the nonlinear model is one with additive er- 
ror defined in (6.108). Then R*(X;, 6) = dg(X;, BY /3ßB x Q7', and the estimating 
equations (6.110) become 


RK, 6) =E| (6.113) 


N 1 
Xi, 
N-! 3 dg(X;, B) 
i=l 3p 
This estimator is asymptotically equivalent to the systems FGNLS estimator that min- 
imizes (6.109). 


G'OY; = X smm) =0 


6.10.4. Nonlinear Systems IV Estimation 


When the regressors X; in the model (6.103) are endogenous, so that E[u;|X;] 4 0, we 
assume the existence of a G x r matrix of instruments Z; such that 


E[u; |Z;] = 0, (6.114) 


where u; is the error term defined in (6.103). We assume that the error term is indepen- 
dent over i, and the variance matrix is Q; = E[w;u; |Z; ]. For the nonlinear SUR model 
Z; is as defined in (6.99). 

The approach is similar to that used in the preceding section for the systems MM 
estimator, with the additional complication that now there may be a surplus of instru- 
ments leading to a need for GMM estimation rather than just MM estimation. Condi- 
tional moment restriction (6.106) leads to many possible unconditional moment condi- 
tions that can be used for estimation. Here we follow many others in basing estimation 
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on the moment conditions E[Z;u;] = 0. Then a systems GMM estimator minimizes 


N £ N 
On(B) = È Zir(y;.Xi, J Wy È Zir(y;. Xi, J ; (6.115) 


i=1 i=l 


This estimator is asymptotically normal with estimated variance 
F [@scmu] = N [D'ZWyZD] | [D'ZWySWyZD][D'ZWyZD], (6.116) 


where DZ =>, or; /3B|3 Z; and S = N~! niž Z;u;w,Z; and we assume u; is inde- 
pendent over i with variance marik V[u;|X;] = 

The choice Wy = [N7 ae ZZ] £ a to NL2SLS in the case 
that ri, Xi, , B) is obtained from a nonlinear SUR model. The choice Wy = 
DEDDY QZJ 1 where Q = N- ! §; ut, is called nonlinear 3SLS (NL3SLS) 
and is the most efficient estimator based on the moment condition E[Z/u;] = 0 in the 
special case that Q; = Q. The choice Wy = =! gives the most efficient estimator un- 
der the more general assumption that Q; may vary with i. As usual, however, moment 
conditions other than E[Z/u;] = 0 may lead to more efficient estimators. 


6.10.5. Nonlinear Simultaneous Equations Systems 


The nonlinear simultaneous equations model specifies that the gth of G equations 
for the ith of N individuals is given by 


Uig =1g(Yi, Xig, Bg), g=l,...,G. (6.117) 


This is the nonlinear SUR model with regressors that now include dependent variables 
from other equations. Unlike the linear SEM, there are few practically useful results to 
help ensure that a nonlinear SEM is identified. 

Given identification, consistent estimates can be obtained using the GMM estima- 
tors presented in the previous section. Alternatively, we can assume that u; ~ M[0, Q] 
and obtain the nonlinear full-information maximum likelihood estimator. In a de- 
parture from the linear SEM, the nonlinear full-information MLE in general has an 
asymptotic distribution that differs from NL3SLS, and consistency of the nonlinear 
full-information MLE requires that the errors are actually normally distributed. For 
details see Amemiya (1985). 

Handling endogeneity in nonlinear models can be complicated. Section 16.8 con- 
siders simultaneity in Tobit models, where analysis is simpler when the model is linear 
in the latent variables. Section 20.6.2 considers a more highly nonlinear example, en- 
dogenous regressors in count data models. 


6.11. Practical Considerations 


Ideally GMM could be implemented using an econometrics package, requiring little 
more difficulty and knowledge than that needed, say, for nonlinear least-squares esti- 
mation with heteroskedastic errors. However, not all leading econometrics packages 
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provide a broad GMM module. Depending on the specific application, GMM estima- 
tion may require a switch to a more suitable package or use of a matrix programming 
language along with familiarity with the algebra of GMM. 

A common application of GMM is IV estimation. Most econometrics packages in- 
clude linear IV but not all include nonlinear IV estimators. The default standard errors 
may assume homoskedastic errors rather than being heteroskedastic-robust. As already 
emphasized in Chapter 4, it can be difficult to obtain instruments that are uncorrelated 
with the error yet reasonably correlated with the regressor or, in the nonlinear case, the 
appropriate derivative of the error with respect to parameters. 

Econometrics packages usually include linear systems but not nonlinear systems. 
Again, default standard errors may not be robust to heteroskedasticity. 


6.12. Bibliographic Notes 


Textbook treatments of GMM include chapters by Davidson and MacKinnon (1993, 2004), 
Hamilton (1994), and Greene (2003). The more recent books by Hayashi (2000) and 
Wooldridge (2002) place considerable emphasis on GMM estimation. Bera and Bilias (2002) 
provide a synthesis and history of many of the estimators presented in Chapters 5 and 6. 


6.3 The original reference for GMM is Hansen (1982). A good explanation of optimal mo- 
ments for GMM is given in the appendix of Arellano (2003). The October 2002 issue of 
Journal of Business and Economic Statistics is devoted to GMM estimation. 

6.4 The classic treatment of linear IV estimation by Sargan (1958) is a key precursor to GMM. 

6.5 The nonlinear 2SLS estimator introduced by Amemiya (1974) generalizes easily to the 
GMM estimator. 

6.6 Standard references for sequential two-step estimation are Newey (1984), Murphy and 
Topel (1985), and Pagan (1986). 

6.7 A standard reference for minimum distance estimation is Chamberlain (1982). 

6.8 A good overview of empirical likelihood is provided by Mittelhammer, Judge, and Miller 
(2000) and key references are Owen (1988, 2001) and Qin and Lawless (1994). Imbens 
(2002) provides a review and application of this relatively new method. 

6.9 Texts such as Greene’s (2003) provide a more detailed coverage of systems estimation 
than that provided here, especially for linear seemingly unrelated regressions and linear 
simultaneous equations models. 

6.10 Amemiya (1985) presents nonlinear simultaneous equations in detail. 


Exercises 


6-1 For the gamma regression model of Exercise 5.2, E[y|x] = exp(x’) and V[ yx] = 
(exp(x’))?/2. 
(a) Show that these conditions imply that E[x{(y— x’)? — (exp(x’3))?/2}] = 0. 
(b) Use the moment condition in part (a) to form a method of moments estimator 


Bum: & 

(c) Give the asymptotic distribution of Bumm using result (6.13) . 

(d) Suppose we use the moment condition E[x(y — exp(x’@))] in addition to that 
in part (a). Give the objective function for a GMM estimator of 8. 
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6-2 


6-3 


6-4 


6.12. BIBLIOGRAPHIC NOTES 


Consider the linear regression model for data independent over i with y; = 
x, 3+u;. Suppose E[u;|x;] 40 but there are available instruments z; with 
E[u;|z;] = 0 and V[u;|z;] = Or, where dim(z) > dim(x). We consider the GMM es- 
timator ae that minimizes 


=[N- p2 yi — X;8) Wa N- pe yi — X; B). 


(a) Derive the limit distribution of VNG — Bo) using the general GMM result 
(6.11). 

(b) State how to obtain a consistent estimate of the asymptotic variance of 3. 

(c) If errors are homoskedastic what choice of Wy would you use? Explain your 
answer. 

(d) If errors are heteroskedastic what choice of Wy would you use? Explain your 
answer. 


Consider the Laplace intercept-only example at the end of Section 6.3.6, so 
y=+u. Then GMM estimation is based on E[h(j)] = 0, where h(x) = [(y— 
n), (y= u). 

(a) Using knowledge of the central moments of y given in Section 6.3.6, show 
that Go = E[ðh/3u] = [-1, —6]7 and that Sọ = E[hh’] has diagonal entries 2 
and 720 and off-diagonal entries 24. 

(b) Hence show that G}S3'Go = 252/432. 

(c) Hence show that focmm has asymptotic variance 1.7143/N. 

(d) Show that the GMM estimator of u with W = lə has asymptotic variance 
19.14/N. 


This question uses the probit model but requires little knowledge of the model. 
Let y denote a binary variable that takes value O or 1 according to whether or 
not an event occurs, let x denote a regressor vector, and assume independent 
observations. 


(a) Suppose E[y|x] = ®(x’3), where ®(-) is the standard normal cdf. Show that 
E[(y — ®(x’B))x] = 0. Hence give the estimating equations for a method of 
moments estimator for (3. 

(b) Will this estimator yield the same estimates as the probit MLE? [For just this 
part you need to read Section 14.3.] 

(c) Give a GMM objective function corresponding to the estimator in part (a). 
That is, give an objective function that yields the same first-order conditions, 
up to a full-rank matrix transformation, as those obtained in part (a). 

(d) Now suppose that because of endogeneity in some of the components 
E[ y|x] 4 ®(x’B). Assume there exists a vector z, dim[z] > dim[x], such that 
E[y — ®(x’G)|z] = 0. Give the objective function for a consistent estimator of 
B. The estimator need not be fully efficient. 

(e) For your estimator in part (d) give the asymptotic distribution of the estimator. 
State clearly any assumptions made on the dgp to obtain this result. 

(f) Give the weighting matrix, and a way to calculate it, for the optimal GMM 
estimator in part (d). 

(g) Give a real-world example of part (d). That is, give a meaningful example of 
a probit model with endogenous regressor(s) and valid instrument(s). State 
the dependent variable, the endogenous regressor(s), and the instrument(s) 
used to permit consistent estimation. [This part is surprisingly difficult.] 
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6-5 Suppose we impose the constraint that E[w;] = g(@), where dim[w] > dim[6]. 
(a) Obtain the objective function for the GMM estimator. 
(b) Obtain the objective function for the minimum distance estimator (see Sec- 
tion 6.7) with m =E[w,] and 7 = w. 
(c) Show that MD and GMM are equivalent in this example. 

6-6 The MD estimator (see Section 6.7) uses the restriction m — g(@) = 0. Suppose 
more generally that the restriction is h(@, 7) = 0 and we estimate using the gen- 
eralized MD estimator that minimizes Qy(0) = h(0, TYW nh(0, T). Adapt (6.68)— 
(6.70) to show that (6.67) holds with Gp = ah(0, T)/30| 5. and V[7] replaced by 
H,V[7]Ho, where Ho = 3h(0, 7)/OT |. xo" 

6-7 For data generated from the dgp given in Section 6.6.4 with N = 1,000, obtain 
NL2SLS estimates and compare these to the two-stage estimates. 
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Hypothesis Tests 


7.1. Introduction 


In this chapter we consider tests of hypotheses, possibly nonlinear in the parameters, 
using estimators appropriate for nonlinear models. 

The distribution of test statistics can be obtained using the same statistical theory as 
that used for estimators, since test statistics like estimators are statistics, that is, func- 
tions of the sample. Given appropriate linearization of estimators and hypotheses, the 
results closely resemble to those for testing linear restrictions in the linear regression 
model. The results rely on asymptotic theory, however, and exact t- and F-distributed 
test statistics for the linear model under normality are replaced by test statistics that 
are asymptotically standard normal distributed (z-tests) or chi-square distributed. 

There are two main practical concerns in hypothesis testing. First, tests may have 
the wrong size, so that in testing at a nominal significance level of, say, 5%, the ac- 
tual probability of rejection of the null hypothesis may be much more or less than 
5%. Such a wrong size is almost certain to arise in moderate size samples as the un- 
derlying asymptotic distribution theory is only an approximation. One remedy is the 
bootstrap method, introduced in this chapter but sufficiently important and broad to be 
treated separately in Chapter 11. Second, tests may have low power, so that there is low 
probability of rejecting the null hypothesis when it should be rejected. This potential 
weakness of tests is often neglected. Size and power are given more prominence here 
than in most textbook treatments of testing. 

The Wald test, the most widely used testing procedure, is defined in Section 7.2. 
Section 7.3 additionally presents the likelihood ratio test and score or Lagrange mul- 
tiplier tests, applicable when estimation is by ML. The various tests are illustrated in 
Section 7.4. Section 7.5 extends these tests to estimators other than ML, including ro- 
bust forms of tests. Sections 7.6, 7.7, and 7.8 present, respectively, test power, Monte 
Carlo simulation methods, and the bootstrap. 

Methods for determining model specification and selection, rather than hypothesis 
tests per se, are given separate treatment in Chapter 8. 
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7.2. Wald Test 


The Wald test, due to Wald (1943), is the preeminent hypothesis test in microecono- 
metrics. It requires estimation of the unrestricted model, that is, the model without 
imposition of the restrictions of the null hypothesis. The Wald test is widely used be- 
cause modern software usually permits estimation of the unrestricted model even if 
it is more complicated than the restricted model, and modern software increasingly 
provides robust variance matrix estimates that permit Wald tests under relatively weak 
distributional assumptions. The usual statistics for tests of statistical significance of 
regressors reported by computer packages are examples of Wald test statistics. 

This section presents the Wald test of nonlinear hypotheses in considerable detail, 
presenting both theory and examples. The closely related delta method, used to form 
confidence intervals or regions for nonlinear functions of parameters, is also presented. 
A weakness of the Wald test — its lack of invariance to algebraically equivalent param- 
eterizations of the null hypothesis — is detailed at the end of the section. 


7.2.1. Linear Hypotheses in Linear Models 


We first review standard linear model results, as the Wald test is a generalization of the 
usual test for linear restrictions in the linear regression model. 

The null and alternative hypotheses for a two-sided test of linear restrictions on the 
regression parameters in the linear regression model y = X’3 + u are 


Ho : R6 -r = 0, 


Ha : RB, — r £ 0, MD 


where in the notation used here there are h restrictions, R is an h x K matrix of con- 
stants of full rank h, 3 is the K x 1 parameter vector, r is an h x 1 vector of constants, 
andh < K. 

For example, a joint test that 8; = 1 and 62 — 63 = 2 when K = 4 can be expressed 


as (7.1) with 
10 00 1 
Relea =e 


_The Wald test of RGy — r = 0 is a test of closeness to zero of the sample analogue 
RG — r, where 8 is the unrestricted OLS estimator. Under the strong assumption that 
u ~ N’[0, of I], the estimator B ~ N [B0, oé (X’X)~'] and so 

RB -r ~ N [0, RAX) R], 
under Ho, where RG, — r = 0 has led to simplification to a mean of 0. Taking the 
quadratic form leads to the test statistic 


Wi = (RG — rY [o RX'X) RT (RB — r), 


which is exactly x?(h) distributed under Ho. In practice the test statistic W; cannot be 


calculated, however, as oê is not known. 
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In large samples replacing of by its estimate s? does not affect the limit distribution 
of W4, since this is equivalent to premultiplication of W; by o/s? and plim(o;/s?) = 
1 (see the Transformation Theorem A.12). Thus 


W2 = (RG — rY [SRA X RT (RB - r) (1.2) 


converges to the x?(h) distribution under Ho. 

The test statistic W2 is chi-square distributed only asymptotically. In this linear 
example with normal errors an alternative exact small-sample result can be obtained. 
A standard result derived in many introductory texts is that 


W3 = W>/h 


is exactly F(h, N — K) distributed under Ho, if s? = (N — K)~! >>, @?, where T; is 
the OLS residual. This is the familiar F—test statistic, which is often reexpressed in 
terms of sums of squared residuals. 

Exact results such as that for W3 are not possible in nonlinear models, and even in 
linear models they require very strong assumptions. Instead, the nonlinear analogue of 


W2 is employed, with distributional results that are asymptotic only. 


7.2.2. Nonlinear Hypotheses 


We consider hypothesis tests of h restrictions, possibly nonlinear in parameters, on 
the q x 1 parameter vector 0, where h < q. For linear regression 0 = G and q = K. 
The null and alternative hypotheses for a two-sided test are 


Ho : h(6@0) = 0, 


Ay : h(9) x 0, N 


where h(-) is a h x 1 vector function of 0. Note that h(0) in this chapter is used to 
denote the restrictions of the null hypothesis. This should not be confused with the use 
of h(w, 0) in the previous chapter to denote the moment conditions used to form an 
MM or GMM estimator. 

Familiar linear examples include tests of statistical significance of a single coeffi- 
cient, h(@) = 6; = 0, and tests of subsets of coefficients, h(@) = 02 = 0. A nonlinear 
example of a single restriction is A(0) = 01/02 — 1 = 0. These examples are studied 
in later sections. 

It is assumed that /(@) is such that the h x q matrix 


oh(é) 


RO) =a 


(7.4) 


is of full rank h when evaluated at 9 = 00. This assumption is equivalent to linear inde- 
pendence of restrictions in the linear model, in which case R(@) = R does not depend 
on @ and has rank h. It is also assumed that the parameters are not at the boundary 
of the parameter space under the null hypothesis. This rules out, for example, testing 
Ho : 6; = Oif the model requires 6; > 0. 
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7.2.3. Wald Test Statistic 


The intuition behind the Wald test is very simple. The obvious test of whether h(@9) = 
0 is to ob estimate @ without i imposing the restrictions and see whether hð) ~ ~0. 
If h(@) ~ AN (0, V[h(@)]] under Hp then the test statistic 


= hV [VKO bh) ~ x(n). 


The only complication is finding V[h@)], which will depend on the restrictions h(-) 
and the estimator 0. 

By a first-order Taylor series expansion (see section 7.2.4) under the null hypoth- 
esis, h(6) has the same limit distribution as R(O\(0 — 0o ), where R(@) is defined in 
(7.4). Then hð) is asymptotically normal under Hp with mean zero and variance ma- 
trix R(O)) VAIRO. A consistent estimate is RN -ICR', where R = R@) and it is 
assumed that the estimator 6 is root-N consistent with 


VN@ — 8o) > NTO, Col, (7.5) 


and C is any consistent estimate of Co. 


Common Versions of the Wald Test 
The preceding discussion leads to the Wald test statistic 
W = Nh [RCR'J}"h, (7.6) 
where h = h(6) and R = ə3h(0)/30' l g An equivalent expression is W = h’ [RV 
0R- th, where 710] = N-'!C is the estimated asymptotic variance of 0. 

The test statistic W is T x?’ ) distributed under Ho. So Hp is rejected 
against H, at significance level a if W > x2(A) and is not rejected otherwise. Equiv- 
alently, Ho is rejected at level «œ if the p-value, which equals Pr[ x7(h) >W], is less 
than «œ. 


One can also implement the Wald test statistic as an F—test. The Wald asymptotic 
F-statistic 


F=W/h (7.7) 


is asymptotically F(h, N — q) distributed. This yields the same p-value as W in (7.6) 
as N — œ though in finite samples the p-values will differ. For nonlinear models it 
is most common to report W, though F is also used in the hope that it might provide a 
better approximation in small samples. 

For a test of just one restriction, the square root of the Wald chi-square test is a 
standard normal test statistic. This result is useful as it permits testing a one-sided 
hypothesis. Specifically, for scalar h(@) the Wald z-test statistic is 


a~ 


h 
W, = ————, (7.8) 
VTN-!CT 
where h = n(0) and T = əh(0)/30'|z is a 1 x k vector. Result (7.6) implies that 
W, is asymptotically standard normal distributed under Hp. Equivalently, W, is 
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asymptotically ¢ distributed with (N — q) degrees of freedom, since the t goes to the 
normal as N — oo. So W, can also be a Wald t-test statistic. 


Discussion 


The Wald test statistic (7.6) for the nonlinear case has the same form as the linear 
model statistic W> given in (7.2). The estimated deviation from the null hypothesis is 
h@) rather than (RG — r). The matrix R is replaced by the estimated derivative matrix 
R, and the assumption that R is of full rank is replaced by the assumption that Ro is of 
full rank. Finally, the estimated asymptotic variance of the estimator is N -IĈ rather 
than s?(X’X)7!. 

There is a range of possible consistent estimates of Co (see Section 5.5.2), lead- 
ing in practice to different computed values of W or F or W, that are asymptotically 
equivalent. In particular, Co is often of the sandwich form Ag ipByac , consistently es- 
timated by a robust estimate AIBA.. An advantage of the Wald test is that it is easy 
to robustify to ensure valid statistical inference under relatively weak distributional 
assumptions, such as potentially heteroskedastic errors. 

Rejection of Ho is more likely the larger is W or F or, for two-sided tests, W+. 
This happens the further hð) is from the null hypothesis value 0; the more efficient 
the estimator 8, since then C is small; and the larger the sample size since then N7! 
is small. The last result is a consequence of testing at unchanged significance level 
a as sample size increases. In principle one could decrease œ as the sample size is 
increased. Such penalties for fully parametric models are presented in Section 8.5.1. 


7.2.4. Derivation of the Wald Statistic 


By an exact first-order Taylor series expansion around ĝo 


~ oh 
h(6) = h(0 a 
(0) Oo) + Fl 


@ ~ 80), 
for some 6+ between Ô and Oo. It follows that 
VN(h@) — (Go) = ROVNO — 00), 
where R(@) is defined in (7.4), which implies that 
VN (h@) — h(8o)) > N [0, RoCoRo'] (7.9) 


by direct application of the limit normal product rule (Theorem A.7) as R@*) > 
Ro = R(@o) and using the limit distribution for JN N@ — Qo) given in (7.5). 
Under the null hypothesis (7.9) simplifies since h(@9) = 0, and hence 
VNh@) S N [0, RoCoRo’] (7.10) 


under Ho. One could in theory use this multivariate normal distribution to define a 
rejection region, but it is much simpler to transform to a chi-square distribution. Re- 
call that z ~ N’[0, Q] with Q of full rank implies z'Q7'z ~ x2(dim(Q)). Then (7.10) 
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implies that 
Nh) [RoCoRo'}'h@) S x2(h), 


under Ho, where the matrix inverse in this expression exists by the assumptions that Ro 
and Cp are of full rank. The Wald statistic defined in (7.6) is obtained upon replacing 
Ro and Co by consistent estimates. 


7.2.5. Wald Test Examples 


The most common tests are tests of one or more exclusion restrictions. We also provide 
an example of test of a nonlinear hypothesis. 


Tests of Exclusion Restrictions 


Consider the exclusion restrictions that the last h components of 0 are equal to zero. 
Then h(@) = 82 = 0 where we partition 0 = (6',, 05)’. It follows that 


LUME 00> 


R(6) = 
0) = ag 06, 30, 


|= Ih], 


where 0 is a (q — h) x q matrix of zeros and I, is anh x h identity matrix, so 


R(O)C(A)R(BY' = [0 Ih] be | p -= Cy. 


The Wald test statistic for exclusion restrictions is therefore 
W = OIN Ca Oh, (7.11) 


where N iC = Va], and is asymptotically distributed as x° (h) under Ho. 

This test statistic is a generalization of the test of subsets of regressors in the linear 
regression model. In that case small-sample results are available if errors are normally 
distributed and the related F-test is instead used. 


Tests of Statistical Significance 


Tests of significance of a single coefficient are tests of whether or not 6;, the jth 
component of 0, differs from zero. Then h(8) = 6; and r(@) = dh/06' is a vector of 
zeros except for a jth entry of 1, so (7.8) simplifies to 
Oj 
W, = ——, (7.12) 
se[0 ;] 


where se[6 j] = /N~'c;; is the standard error of 0; and Ç}; is the jth diagonal entry 
in Č. 

The test statistic W, in (7.12) is often called a “t-statistic”, owing to results for 
the linear regression model under normality, but strictly speaking it is an asymptotic 
“z-statistic.” 
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For a two-sided test of Ho : 0jo = 0 against H, : Ojo 4 0, Ho is rejected at signifi- 
cance level a if |W,| > Zq/2 and is not rejected otherwise. This yields exactly the same 
results as the Wald chi-square test, since w2 = W, where W is defined in (7.6), and 
Zap = x0). 

Often there is prior information about the sign of 0;. Then one should use a one- 
sided hypothesis test. For example, suppose it is felt based on economic reasoning or 
past studies that 0; > 0. It makes a difference whether 6; > 0 is specified to be the null 
or the alternative hypothesis. For one-sided tests it is customary to specify the claim 
made as the alternative hypothesis, as it can be shown that then stronger evidence is 
required to support the claim. Here Ho : Ojo < O is rejected against H, : Ojo > O at 
significance level œ if W; > Za. Similarly, for a claim that 6; < 0, test Ho : 0jo > 0 
against Ha : Ojo < 0 and reject Ho at significance level œ if W; < —Zz. 

Computer output usually gives the p-value for a two-sided test, but in many cases 
it is more appropriate to use a one-sided test. If 0; has the “correct” sign then the 
p-value for the one-sided test is half that reported for a two-sided test. 


Tests of Nonlinear Restriction 
Consider a test of the single nonlinear restriction 
Ho : h(0) = 6/62 — 1 = 0. 


Then R(@) is a 1 xq vector with first element 04/06; = 1/62, second element 
dh/d6, = —0ı/ 63, and remaining elements zero. By letting Cj, denote the jkth el- 
ement of C, (7.6) becomes 


ae x -1 
Ps ~ C11 C12 +t: 1/0 
0i : 1 A, ae ee y on 
W=N({~-1 a =) C21 C22 + °° 61/6; , 
02 02 03 $ oe. 0 
where 0 is a (q — 2) x q matrix of zeros, yielding 
~~ my aD ao Sa a 
W = N[62(61 — 02) 03211 — 20102012 + 61022) ', (7.13) 


which is asymptotically x7(1) distributed under Ho. Equivalently, W is asymptoti- 
cally standard normal distributed. 


7.2.6. Tests in Misspecified Models 


Most treatments of hypothesis testing, including that given in Chapters 7 and 8 of 
this book, assume that the null hypothesis model is correctly specified, aside from 
relatively minor misspecification that does not affect estimator consistency but requires 
robustification of standard errors. 

In practice this is a considerable oversimplification. For example, in testing for het- 
eroskedastic errors it is assumed that this is the only respect in which the regression 
is deficient. However, if the conditional mean is misspecified then the true size of 
the test will differ from the nominal size, even asymptotically. Moreover, asymptotic 
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equivalence of tests, such as that for the Wald, likelihood ratio, and Lagrange mul- 
tiplier tests, will no longer hold. The better specified the model, however, the more 
useful are the tests. 

Also, note that tests often have some power against hypotheses other than the ex- 
plicitly stated alternative hypothesis. For example, suppose the null hypothesis model 
is y = fı + Box + u, where u is homoskedastic. A test of whether to also include z as 
a regressor will also have some power against the alternative that the model is nonlin- 
ear in x, for example y = 6; + Box + bsx? + u, if x and z are correlated. Similarly, a 
test against heteroskedastic errors will also have some power against nonlinearity in x. 
Rejection of the null hypothesis does not mean that the alternative hypothesis model 
is the only possible model. 


7.2.7. Joint Versus Separate Tests 


In applied work one often wants to know which coefficients out of a set of coefficients 
are “significant.” When there are several hypotheses under test, one can either do a 
joint test or simultaneous test of all hypotheses of interest or perform separate tests 
of the hypotheses. 

A leading example in linear regression concerns the use of separate t-tests for test- 
ing the null hypotheses Ajo : 6; = 0 and Moo : 62 = O versus using an F-test of the 
joint hypothesis Ho : 6; = 2 = 0, where throughout the alternative is that at least 
one of the parameters does not equal zero. The F-test is an explicit joint test, with 
rejection of Hp if the estimated point (Bi, Bo) falls outside an elliptical probability 
contour. Alternatively, the two separate t-tests can be conducted. This procedure is an 
implicit joint test, called an induced test (Savin, 1984). The separate tests reject Ho if 
either Hy or Hao is rejected, which occurs if (Bi, Bo) falls outside a rectangle whose 
boundaries are the critical values of the two test statistics. Even if the same signifi- 
cance level is used to test Ho, so that the ellipse and rectangles have the same area, 
the rejection regions for the joint and separate tests differ and there is a potential for a 
conflict between them. For example, @ı ; Ba) may lie within the ellipse but outside the 
rectangle. 

Let e; and ez denote the event of type I error (see Section 7.5.1) in the two separate 
tests, and let ey = e; U e denote the event of a type I error in the induced joint test. 
Then Pr[e;] = Prle,] + Pr[e2] — Prle; N e2], which implies that 


ay <a; +a, (7.14) 


where œr, a@;, and œ denote the sizes of, respectively, the induced joint test, the first 
separate test, and the second separate test. In the special case where the separate tests 
are statistically independent, Pr[e; N e2] = Pr[e;] Pr[e2] = a,a2 and hence a; = a; + 
Ql — 1. For a typically low value of a, and a, such as .05 or .01, a, a2 is very 
small and the upper bound (7.14) is a good indicator of the size of the test. 

A substantial literature on induced tests examines the problem of choosing critical 
values for the separate tests such that the induced test has a known size. We do not pur- 
sue this issue at length but mention the Bonferroni f-test as an example. The critical 
values of this test have been tabulated; see Savin (1984). 
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Statistically independent tests arise in linear regression with orthogonal regressors 
and in likelihood-based testing (see Section 7.3) if relevant parts of the information 
matrix are diagonal. Then the induced joint test statistic is based on the two statistically 
independent separate test statistics, whereas the explicit joint null test statistic is the 
sum of the two separate test statistics. The joint null may be rejected because either 
one component or both components of the null are rejected. The use of separate tests 
will reveal which situation applies. 

In the more general case of correlated regressors or a nondiagonal information ma- 
trix, the explicit joint test suffers from the disadvantage that the rejection of the null 
does not indicate the source of the rejection. If the induced joint test is used then set- 
ting the size of the test requires some variant of the Bonferroni test or approximation 
using the upper bound in (7.14). Similar issues also arise when separate tests are ap- 
plied sequentially, with each stage conditioned on the outcome of the previous stage. 
Section 18.7.1 presents an example with discussion of a joint test of two hypotheses 
where the two components of the test are correlated. 


7.2.8. Delta Method for Confidence Intervals 


The method used to derive the Wald test statistic is called the delta method, as Taylor 
series approximation of hð) entails taking the derivative of h(@). This method can 
also be used to obtain the distribution of a nonlinear combination of parameters and 
hence form confidence intervals or regions. 

One example is estimating the ratio 0; /02 by 0; /0>. A second example is prediction 
of the conditional mean g(x’), say, using gx B). A third example is the estimated 
elasticity with respect to change in one component of x. 


Confidence Intervals 
Consider inference on the parameter vector y = h(0) that is estimated by 
7 =h@), (7.15) 


where the limit distribution of MN @ — Oo) is that given in (7.5). Then direct ap- 
plication of (7.9) yields VNG — y) $ N [0, RoCoRo'], where R(@) is defined in 
(7.4). Equivalently, we say that Ẹ is asymptotically normally distributed with estimated 
asymptotic variance matrix 


VIF] = RN~'CR’, (7.16) 


a result that can be used to form confidence intervals or regions. 
In particular, a 100(1 — w)% confidence interval for the scalar parameter y is 


Y €V+tzpsely], (7.17) 
where 
se(y] = VEN-'CF, (7.18) 
where F = r(@) and r(0) = dy /30' = 3h(0)/30". 
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Confidence Interval Examples 


As an example, suppose that EL y|x] = exp (x’3) and we wish to obtain a confidence 
interval for the predicted conditional mean when x = x,. Then h() = exp x, B), so 
ðh/38' = exp (x, 6)x , and (7.18) yields 


se[exp x, D] = exp xD), [x N-'Čx,, 


where C is a consistent estimate of the variance matrix in the limit distribution of 
JN(B — Bo). 

As a second example, suppose we wish to obtain a confidence interval for ef rather 
than for £, a scalar coefficient. Then h(B) = ef, so 0h/dB = ef and (7. 18) yields 
se[e?] = eF se[B]. This yields a 95% confidence interval for ef of eê +1, 96e* seff]. 

The delta method is not always the best method to obtain a confidence interval, 
because it restricts the confidence interval to being symmetric about 7. Moreover, in 
the preceding example the confidence interval can include negative values even though 
e? > 0. An alternative confidence interval is obtained by exponentiation of the terms 
in the confidence interval for 6. Then 


Pr[B — 1.96se[B] < 8 < B + 1.96se[f]] = 0.95 
= Pr [exp — 1.96se[B]) < ef < exp(B + 1.96se[B])] = 0.95. 


This confidence interval has the advantage of being asymmetric and including only 
positive values. This transformation is often used for confidence intervals for slope 
parameters in binary outcome models and in duration models. The approach can be 
generalized to other transformations y = h(0), provided h(-) is monotonic. 


7.2.9. Lack of Invariance of the Wald Test 


The Wald test statistic is easily obtained, provided estimates of the unrestricted model 
can be obtained, and is no less powerful than other possible test procedures, as dis- 
cussed in later sections. For these reasons it is the most commonly used test procedure. 

However, the Wald test has a fundamental problem: It is not invariant to alge- 
braically equivalent parameterizations of the null hypothesis. For example, consider 
the example of Section 7.2.5. Then Ho : 01/02 — 1 = 0 can equivalently be expressed 
as Ho : 0; — 02 = 0, leading to Wald chi-square test statistic 


W* = N@, — 62) Gu — Win +n), (7.19) 


which differs from W in (7.13). The statistics W and W* can differ substantially in 
finite samples, even though asymptotically they are equivalent. The small-sample dif- 
ference can be quite substantial, as demonstrated in a Monte Carlo exercise by Gregory 
and Veall (1985), who considered a very similar example. For tests with nominal size 
0.05, one variant of the Wald test had actual size between 0.04 and 0.06 across all sim- 
ulations, so asymptotic theory provided a good small-sample approximation, whereas 
an alternative asymptotically equivalent variant of the Wald test had actual size that in 
some simulations exceeded 0.20. 
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Phillips and Park (1988) explained the differences by showing that, although differ- 
ent representations of the null hypothesis restrictions have the same chi-square distri- 
bution using conventional asymptotic methods, they have different asymptotic distri- 
butions using a more refined asymptotic theory based on Edgeworth expansions (see 
Section 11.4.3). Furthermore, in particular settings such as the previous example, the 
Edgeworth expansions can be used to indicate parameterizations of Ho and regions 
of the parameter space where the usual asymptotic theory is likely to provide a poor 
small-sample approximation. 

The lesson is that care is needed when nonlinear restrictions are being tested. As 
a robustness check one can perform several Wald tests using different algebraically 
equivalent representations of the null hypothesis restrictions. If these lead to substan- 
tially different conclusions there may be a problem. One solution is to perform a boot- 
strap version of the Wald test. This can provide better small-sample performance and 
eliminate much of the difference between Wald tests that use different representations 
of Hp, because from Section 11.4.4 the bootstrap essentially implements an Edgeworth 
expansion. A second solution is to use other testing methods, given in the next section, 
that are invariant to different representations of Ho. 


7.3. Likelihood-Based Tests 


In this section we consider hypothesis testing when the likelihood function is known, 
that is, the distribution is fully specified. There are then three classical statistical tech- 
niques for testing hypotheses — the Wald test, the likelihood ratio (LR) test, and the 
Lagrange multiplier (LM) test. A fourth test, the C(a) test, due to Neyman (1959), is 
less commonly used and is not presented here; see Davidson and MacKinnon (1993). 
All four tests are asymptotically equivalent, so one chooses among them based on ease 
of computation and on finite-sample performance. We also do not cover the smooth 
test of Neyman (1937), which Bera and Ghosh (2002) argue is optimal and is as fun- 
damental as the other tests. 

These results assume correct specification of the likelihood function. Extension to 
tests based on quasi-ML estimators, as well as on m-estimators and efficient GMM 
estimators, is given in Section 7.5. 


7.3.1. Wald, Likelihood Ratio, and Lagrange Multiplier (Score) Tests 


Let L(@) denote the likelihood function, the joint conditional density of y given X and 
parameters 0. We wish to test the null hypothesis given in (7.3) that h(@o) = 0. 

Tests other than the Wald test require estimation that imposes the restrictions of the 
null hypothesis. Define the estimators 


0, (unrestricted MLE), 


7.2 
0, (restricted MLE). (220) 


The unrestricted MLE 0, maximizes In L(6); it was more simply denoted @ in ear- 
lier discussion of the Wald test. The restricted MLE 0, maximizes the Lagrangian 
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In L(0) — A'h(0), where A is an h x 1 vector of Lagrangian multipliers. In the simple 
case of exclusion restrictions h(@) = 02 = 0, where 0 = (0', 04)’, the restricted MLE 
is ð, = Cs 0’), where O, is obtained simply as the maximum with respect to 0; of 
the restricted likelihood In L(@;, 0) and 0 is a (q — h) x 1 vector of zeros. 

We motivate and define the three test statistics here, with derivation deferred to 
Section 7.3.3. All three test statistics converge in distribution to x7(h) under Ho. So 
Ho is rejected at significance level œ if the computed test statistic exceeds x2(h). 
Equivalently, reject Ho at level a if p < œ, where p = Pr[x Xh) > t] is the p-value 
and ¢ is the computed value of the test statistic. 


Likelihood Ratio Test 


The motivation for the LR test statistic is that if Hp is true, the unconstrained and 
constrained maxima of the log-likelihood function should be the same. This suggests 
using a function of the difference between In LO, ) and In L@, ). 

Implementation requires obtaining the limit distribution of this difference. It can be 
shown that twice the difference is asymptotically chi-square distributed under Ho. This 
leads immediately to the likelihood ratio test statistic 


LR = —2 [In L@,)—In L@,)]. (7.21) 


Wald Test 


The motivation for the Wald test is that if Hp is true, the unrestricted MLE 6, should 
satisfy the restrictions of Ho, so h@,) should be close to zero. 

Implementation requires obtaining the asymptotic distribution of hð, ). The general 
form of the Wald test is given in (7.6). Specialization occurs for the MLE because by 
the IM equality V[@,,] = —N~!Ao7!, where 


ə? InL 
Ao = plim EE (7.22) 
3090’ |» 
This leads to the Wald test statistic 
W = -NR [RAR] f, (1.23) 


where h = h(6,,), R= R@,), R(6) = əh(0)/30', and A is a consistent estimate of Ao. 
The minus sign appears since Ag is negative definite. 


Lagrange Multiplier Test or Score Test 


One motivation for the LM test statistic is that the gradient 0InL/0d0|g = 0 at the 
maximum of the likelihood function. If Hp is true, then this maximum should also 
occur at the restricted MLE (i.e., In L/00|5 ~ 0) because imposing the constraint 
will have little impact on the estimated value of 0. Using this motivation LM is called 
the score test because 0 In L/06 is the score vector. 

An alternative motivation is to measure the closeness to zero of the Lagrange mul- 
tipliers of the constrained optimization problem for the restricted MLE. Maximizing 
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In L(0) — A'h(0) with respect to 6 implies that 
dInL| _ dh@y 
00 |g 30 


x (7.24) 


6, 


It follows that tests based on the estimated Lagrange multipliers À, are equivalent to 
tests based on the score ð In L/00|g , since 3h/30' is assumed to be of full rank. 

Implementation requires obtaining the asymptotic distribution of ð In L/30|ğ . This 
leads to the Lagrange multiplier test or score test statistic Í 


ðlnL re aln L 


LM = -N 7! 
30' 30 


> (7.25) 


0, 


where A is a consistent estimate of Ao in (7.22) evaluated at ð, rather than 6,,. 

The LM test, due to Aitchison and Silvey (1958) and Silvey (1959), is equivalent to 
the score test, due to Rao (1947). The test statistic LM is usually derived by obtaining 
an analytical expression for the score rather than the Lagrange multipliers. Econome- 
tricians usually call the test an LM test, even though a clearer terminology is to call it 
a score test. 


Discussion 


Good intuition is provided by the expository graphical treatment of the three tests by 
Buse (1982) that views all three tests as measuring the change in the log-likelihood. 
Here we provide a verbal summary. 

Consider scalar parameter and a Wald test of whether 6) — 6* = 0. Then a given 
departure of @, from 6* will translate into a larger change in In L, the more curved 
is the log-likelihood function. A natural measure of curvature is the second derivative 
H(6) = 3? In L/30?. This suggests W= —(@, — 6*)2H(,). The statistic W in (7.23) 
can be viewed as a generalization to vector 0 and more general restrictions h(@9) with 
NA measuring the curvature. 

For the score test Buse shows that a given value of ln L/06|g_ translates into a 
larger change in In L, the less curved is the log-likelihood function. This leads to use 
of (N A)! in (7.25). And the statistic LR directly compares the log-likelihoods. 


An Illustration 


To illustrate the three tests consider an iid example with y; ~ N[uo, 1] and test of 
Ho : uo = u*. Then t, = y and pw, = u*. 
For the LR test, In L(u) = — X In 27 — 5 Vii - u)? and some algebra yields 


LR = 2[In L(y) — InL(u*)] = NG — BY. 


The Wald test is based on whether y — u* ~ 0. Here it is easy to show that y — 
u* ~ N[O0, 1/N] under Ho, leading to the quadratic form 


W = -WUNI O — n^). 
This simplifies to N(y — u*)? and so here W = LR. 
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The LM test is based on closeness to zero of ð In L(u)/ð u|u = ÈO Wla = 
N(y — u*). This is just a rescaling of (y — u*) so LM = W. More formally, A(u*) = — 
1 since 07 In L(u)/ðu? = —N and (7.25) yields 


LM = N (NO = DUINO — u*)). 


This also simplifies to N(y — u*)} and verifies that LM = W =LR. 

Despite their quite different motivations, the three test statistics are equivalent here. 
This exact equivalence is special to this example with constant curvature owing to a 
log-likelihood quadratic in u. More generally the three test statistics differ in finite 
samples but are equivalent asymptotically (see Section 7.3.4). 


7.3.2. Poisson Regression Example 


Consider testing exclusion restrictions in the Poisson regression model introduced in 
Section 5.2. This example is mainly pedagogical as in practice one should perform 
statistical inference for count data under weaker distributional assumptions than those 
of the Poisson model (see Chapter 20). 

If y given x is Poisson distributed with conditional mean exp(x' 68) then the log- 
likelihood function is 


In L(8) =X, {—exp(x,) + yixi6 — In y;!} (1.26) 


For h exclusion restrictions the null hypothesis is Ho : h(6) = 6, = 0, where B = 
(8, 34)’. T 

The unrestricted MLE 6 maximizes (7.26) with respect to 8 and has first-order 
conditions J`; (y; — exp(x;3))x; = 0. The limit variance matrix is -A~', where 


A = — plim N7! >, exp (x; 3)x;X;. 


The restricted MLE is B = B 0’)’, where By maximizes (7.26) with respect to Bi, 
with x; 6 replaced by x/;, since G, = 0. Thus 6, solves the first-order conditions 
DiGi — exp(x};41))x1i = 0. 

The LR test statistic (7.21) is easily calculated from the fitted log-likelihoods of the 
restricted and unrestricted models. 

The Wald test statistic for exclusion restrictions from Section 7.2.5 is W= 
—N,/A2B,, where £? is the (2,2) block of A~! and A = —N7! >>; exp (x) B)x;x!. 

The LM test is based on 0 In L(6)/36 = 0; Xi (yi — exp (x, 3)). At the restricted 
MLE this equals 5°; x;4;, where u; = y; — exp (x};3,) is the residual from estimation 
of the restricted model. The LM test statistic (7.25) is 


LM = pase xiii] pe exp woxx] ioe xii] l (1.27) 


Some further simplification is possible since }-; X;u; = 0 from the first-order condi- 
tions for the restricted MLE given earlier. The LM test here is based on the correlation 
between the omitted regressors and the residual, a result that is extended to other ex- 
amples in Section 7.3.5. 
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In general it can be difficult to obtain an algebraic expression for the LM test. For 
standard applications of the LM test this has been done and is incorporated into com- 
puter packages. Computation by auxiliary regression may also be possible (see Sec- 
tion 3.5). 


7.3.3. Derivation of Tests 


The distribution of the Wald test was formally derived in Section 7.2.4. Proofs for the 
likelihood ratio and Lagrange multiplier tests are more complicated and we merely 
sketch them here. 


Likelihood Ratio Test 


For simplicity consider the special case where the null hypothesis is 0 = 0, so that 
there is no estimation error in 8, = 0. Taking a second-order Taylor series expansion 
of In L(@) about In L(0,) yields 


a7 In L 


3000 a, 


dInL 


6—90,) +R, 
36! ( y+ 


In L(@) = In L@,) + 


En A Fei a 
_ @-6,)+ 56 -,) 


u 


where R is a remainder term. Since 0 In L/00\g = 0 by the first-order conditions, this 
implies upon rearrangement that 


ə? In L 


0000 a, 


(6 —0,) +R. (7.28) 


iy [in L@)— 1n L@,)| Lipo.) 


The right-hand side of (7.28) is x7(h) under Ho : 0 = @ since by standard results 
J/N(0, — 9) SN [0, —[plim N~'d? In L/06006'}"']. For derivation of the limit dis- 
tribution of LR in the general case see, for example, Amemiya (1985, p. 143). 

A reason for preferring LR is that by the Neyman—Pearson (1933) lemma the uni- 
formly most powerful test for testing a simple null hypothesis versus simple alternative 
hypothesis is a function of the likelihood ratio L(@,)/ L@,), though not necessarily the 
specific function —2 In(L(0,.)/ L@ D) that equals LR given in (7.21) and gives the test 
statistic its name. 


LM or Score Test 
By a first-order Taylor series expansion 


1 dinL| 1 dink 
JN 00 |g VN 30 


and both terms in the right-hand side contribute to the limit distribution. Then the 
x7(h) distribution of LM defined in (7.25) follows since it can be shown that 


1 Ink y 
VN, — 9), 
samage ee 


ðln L Pi 


RoA5' 4S N [0, RoAg BoAg 'Ro] . (7.29) 
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where details are provided in Wooldridge (2002, p. 365), for example, and Ro and Ag 
are defined in (7.4) and (7.22) and 


ı əðlnLəðlnL 
30 30 |g 
0 


Bo = plim N7 (7.30) 
Result (7.29) leads to a chi-square statistic that is much more complicated 
than (7.25), but simplification to (7.25) then occurs by the information matrix 
equality. 


7.3.4. Which Test? 


Choice of test procedure is usually made based on existence of robust versions, finite- 
sample performance, and ease of computation. 


Asymptotic Equivalence 


All three test statistics are asymptotically distributed as x?(h) under Ho. Further- 
more, all three can be shown to be noncentral x?(h;à) distributed with the same 
noncentrality parameter under local alternatives. Details are provided for the Wald 
test in Section 7.6.3. So the tests all have the same asymptotic power against local 
alternatives. 

The finite-sample distributions of the three statistics differ. In the linear regression 
model with normality, a variant of the Wald test statistic for h linear restrictions on 
6 exactly equals the F(h, N — K) statistic (see Section 7.2.1) whereas no analytical 
results exist for the LR and LM statistics. More generally, in nonlinear models exact 
small-sample results do not exist. 

In some cases an ordering of the values taken by the three test statistics can be 
obtained. In particular for tests of linear restrictions in the linear regression model 
under normality, Berndt and Savin (1977) showed that Wald > LR > LM. This result 
is of little theoretical consequence, as the test least likely to reject under the null will 
have the smallest actual size but also the smallest power. However, it is of practical 
consequence for the linear model, as it means when testing at fixed nominal size a 
that the Wald test will always reject Hp more often than the LR, which in turn will 
reject more often than the LM test. The Wald test would be preferred by a researcher 
determined to reject Ho. This result is restricted to linear models. 


Invariance to Reparameterization 


The Wald test is not invariant to algebraically equivalent parameterizations of the null 
hypothesis (see Section 7.2.9) whereas the LR test is invariant. Some but not all ver- 
sions of the LM test are invariant. The LM test is generally invariant if the expected 
Hessian (see Section 5.5.2) is used to estimate Ag and not invariant if the Hessian is 
used. The test LM* defined later in (7.34) is invariant. The lack of invariance for the 
Wald test is a major weakness. 


238 


7.3. LIKELIHOOD-BASED TESTS 


Robust Versions 


In some cases with misspecified density the quasi-MLE (see Section 5.7) remains con- 
sistent. The Wald test is then easily robustified (see Section 7.2). The LM test can be 
robustified with more difficulty; see (7.38) in Section 7.5.1 for a general result for m- 
estimators and Section 8.4 for some robust LM test examples. The LR test is no longer 
chi-square distributed, except in a special case given later in (7.39). Instead, the LR 
test is a mixture of chi-squares (see Section 8.5.3). 


Convenience 


Convenience in computation is also a consideration. LR requires estimation of the 
model twice, once with and once without the restrictions of the null hypothesis. If 
done by a package, it is easily implemented as one need only read off the printed log- 
likelihood routinely printed out, subtract, and multiply by 2. Wald requires estimation 
only under H, and is best to use when the unrestricted model is easy to estimate. For 
example, this is the case for restrictions on the parameters of the conditional mean 
in nonlinear models such as NLS, probit, Tobit, and logit. The LM statistic requires 
estimation only under Hp and is best to use when the restricted model is easy to esti- 
mate. Examples are tests for autocorrelation and heteroskedasticity, where it is easiest 
to estimate the null hypothesis model that does not have these complications. 

The Wald test is often used for tests of statistical significance whereas the LM test 
is often used for tests of correct model specification. 


7.3.5. Interpretation and Computation of the LM test 


Lagrange multiplier tests have the additional advantages of simple interpretation in 
some leading examples and computation by auxiliary regression. 

In this section attention is restricted to the usual cross-section data case of a scalar 
dependent variable independent over i, so that 0 In L(@)/00 = X; s;(@), where 


ð ln fi |x, 0) 
00 


is the contribution of the ith observation to the score vector of the unrestricted model. 
From (7.25) the LM test is a test of the closeness to zero of $; 5;(0,). 


s,(0) = (7.31) 


Simple Interpretation of the LM Test 
Suppose that the density is such that s(@) factorizes as 
s(0)= g(x, O)r(y, x, 0) (7.32) 


for some q x 1 vector function g(-) and scalar function r(y, x, 0), the latter of which 
may be interpreted as a generalized residual because y appears in r(-) but not g(-). For 
example, for Poisson regression 0 In f/00 = x(y — exp(x’Q)). 
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Given (7.32) and independence over i, dInL/ d0lg = = LE @i7;, where ¢; = 
2(Xx;, 0 „) and F; = r(yi, Xi, 0 ,)» The LM test can therefore be simply interpreted as 
a score test of the correlation between g; and the residual 7;. This interpretation was 
given in Section 7.3.2 for the LM test with Poisson regression, where g; = x; and 
Ti = yi — exp; 3). 

The partition (7.32) will arise whenever f(y) is based on a one-parameter den- 
sity. In particular, many common likelihood models are based on one-parameter LEF 
densities, with parameter u then modeled as a function of x and 8. In the LEF case 
r(y, X, 0) = (y — ELy|x]) (see Section 5.7.3), so the generalized residual r(-) in (7.32) 
is then the usual residual. 

More generally a partition similar to (7.32) will also arise when f(y) is based on a 
two-parameter density, the information matrix is block diagonal in the two parameters, 
and the two parameters in turn depend on regressors and parameter vectors @ and a 
that are distinct. Then LM tests on 8 are tests of correlation of Zoi and Tei; where 
s(B) = (x, 9)ra(y, x, 8), with similar interpretation for LM tests on a. 

A leading example is linear regression under normality with two parameters u and 
o? modeled as u = x’ Gand o? = a or o? = o°(z, œa). For exclusion restrictions in lin- 
ear regression under normality, s;(G) = x;(; — x; 6) and the LM test is one of correla- 
tion between regressors X; ang the restricted model residual % Hie = y; — Xi B 1- For tests 
of heteroskedasticity with o? = exp(@ + Z;Q), 5;(&x) =4z; (i — x; B) eae) — ae 
and the LM test is one of ondadon between Z; and the squared residual 1? 

(yi — X; By, since o? is constant under the null hypothesis that a2 = 0. 


Outer Product of the Gradient Versions of the LM Test 


Now return to the general s;(@) defined in (7.31). We show in the following that an 
asymptotically equivalent version of the LM test statistic (7.25) can be obtained by 
running the auxiliary regression or artificial regression 


1 =S; + vi, (7.33) 
where S; = si@,), and computing 
LM* = NR?, (7.34) 


where R? is the uncentered R? defined after (7.36). LM* is asymptotically x°(h) under 
Ho. Equivalently, LM* equals ESS,,, the uncentered explained sum of squares (the sum 
of squares of the fitted values), or equals N— RSS, where RSS is the residual sum of 
squares, from regression (7.33). 

This result can be easy to implement as in many applications it can be quite simple 


to analytically obtain s;(0), generate data for the q components $1 oe , Sgi , and regress 
lonS,,,... Sgi Note that here f(y;|x;, 9) in (7.31) is the density of the unrestricted 
model. 


For the exclusion restrictions in the Poisson model example in Section 7.3.2, 
si(3) = (yi — exp (x; B))x; and x’ B, = = x’ Bi. It follows that LM* can be computed 
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as NR, from regressing 1 on (y; — exp (x), B1,))Xi, where x; contains both x); and x3;, 
and 3,, is obtained from Poisson regression of y; on xj; alone. 

Equations (7.33) and (7.34) require only independence over i. Other auxiliary re- 
gressions are possible if further structure is assumed. In particular, specialize to cases 
where s(@) factorizes as in (7.32), and define r(y, x, 8) so that V[r(y, x, 0)] = 1. Then 
an alternative asymptotically equivalent version of the LM test is N RŽ from regression 
of T; on g;. This includes LM tests for linear regression under normality, such as the 
Breusch—Pagan LM test for heteroskedasticity. 

These alternative versions of the LM test are called outer-product-of-the-gradient 
versions of the LM test, as they replace —Aọ in (7.22) by an outer-product-of-the- 
gradient (OPG) estimate or BHHH estimate of By. Although they are easily computed, 
OPG variants of LM tests can have poor small-sample properties with large size distor- 
tions. This has discouraged use of the OPG form of the LM test. These small-sample 
problems can be greatly reduced by bootstrapping (see Section 11.6.3). Davidson and 
MacKinnon (1984) propose double-length auxiliary regressions that also perform bet- 
ter in finite samples. 


Derivation of the OPG Version 


To derive LM*, first note that in (7.25), dInL(@)/d0|5 =J S;. Second, by the 
information matrix equality Aj = —Bo and, from Section 5.5.2, Bo can be consis- 
tently estimated under Ho by the OPG estimate or BHHH estimate N~! )~s;s,. Com- 
bining, these results gives an asymptotically equivalent version of the LM test sta- 
tistic (7.25): 


LM’ = (5,3) pace Opec (7.35) 


This statistic can be computed from an auxiliary regression of 1 on S; as follows. 
Define S to be the N x q matrix with ith row S$, and define l to be the N x 1 vector of 
ones. Then 


LM* = IS[S’'S]"'S'l = ESS, = NR?. (7.36) 


In general for regression of y on X the uncentered explained sums of squares (ESS,,) 
is y’X (X’X)~'X’y, which is exactly of the form (7.36), whereas the uncentered R? is 
R2 = y'X(X'X) 'X'y/y'y, which here is (7.36) divided by V1 = N. The term uncen- 
tered is used because in RŽ division is by the sum of squared deviations of y around 
zero rather than around the sample mean. 


7.4. Example: Likelihood-Based Hypothesis Tests 
The various test procedures — Wald, LR, and LM - are illustrated using generated data 


from the dgp y|x Poisson distributed with mean exp(1 + B2x2 + 63x3 + B4x4), where 
bı = 0 and 2 = 63 = 64 = 0.1 and the three regressors are iid draws from N’[0, 1]. 
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Table 7.1. Test Statistics for Poisson Regression Example 


Test Statistic Result 
Null Hypothesis Wald LR LM LM* InL at level 0.05 
Hio : Bs = 0 5.904 5.754 5.916 6.218 —241.648 Reject 


(0.015) (0.016) (0.015) (0.013) 

H : B83 =0,84 =O 8.570 8.302 8.575 9.186 —242.922 Reject 
(0.014) (0.016) (0.014) (0.010) 

Axo : B3 — 4 = 0 0.293 0.293 0.293 0.315  —238.918 Do not reject 
(0.588) (0.589) (0.588) (0.575) 

Hao : B3/B4-1=0 0.158 0.293 0.293 0.315 —238.918 Do not reject 
(0.691) (0.589) (0.588) (0.575) 


^ The dgp for y is the Poisson distribution with parameter exp(0.0 + 0.1x2 + 0.1x3 + 0.1x4) and sample size 
N = 200. Test statistics are given with associated p-values in parentheses. Tests of the second hypothesis are 
x? (2) and the other tests are x?(1) distributed. Log-likelihoods for restricted ML estimation are also given; the 
log-likelihood in the unrestricted model is —238.772. 


Poisson regression of y on an intercept, x2, x3, and x4 for a generated sample of size 
200 yielded unrestricted MLE 


ELy|x] = exp(—0.165 — 0.028x2 + 0.163x3 + 0.103x4), 
(—2.14) (—0.36) (2.43) (0.08) 

where associated f-statistics are given in parentheses and the unrestricted log- 

likelihood is —238.772. 

The analysis tests four different hypotheses, detailed in the first column of Table 7.1. 
The estimator is nonlinear, whereas the hypotheses are examples of, respectively, sin- 
gle exclusion restriction, multiple exclusion restriction, linear restrictions, and nonlin- 
ear restrictions. The remainder of the table gives four asymptotically equivalent test 
statistics of these hypotheses and their associated p-values. For this sample all tests re- 
ject the first two hypotheses and do not reject the remaining two, at significance level 
0.05. 

The Wald test statistic is computed using (7.23). This requires estimation of the un- 
restricted model, given previously, to obtain the variance matrix estimate of the unre- 
stricted MLE. Wald tests of different hypotheses then require computation of different 
h and R and simplify in some cases. The Wald chi-square test of the single exclu- 
sion restriction is just the square of the usual t-test, with 2.43? ~ 5.90. The Wald test 
statistic of the joint exclusion restrictions is detailed in Section 7.2.5. Here x3 is sta- 
tistically significant and x, is statistically insignificant, whereas jointly x3 and x4 are 
statistically significant at level 0.05. The Wald test for the third hypothesis is given in 
(7.19) and leads to nonrejection. The third and fourth hypotheses are equivalent, since 
B3/B4 — 1 = 0 implies £3; = £4, but the Wald test statistic for the fourth hypothesis, 
given in (7.13), differs from (7.19). The statistic (7.13) was calculated using matrix 
operations, as most packages will at best calculate Wald tests of linear hypotheses. 

The LR test statistic is especially easy to compute, using (7.21), given estima- 
tion of the restricted model. For the first three hypotheses the restricted model is 
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estimated by Poisson regression of y on, respectively, regressors (1, x2, x4), (1, x2), and 
(1, x2, x3 + x4), where the third regression uses 63x3 + B4x4 = 63(x3 + x4) if B3 = Ba. 
As an example of the LR test, for the second hypothesis LR = —2[—238.772 — 
(—242.922)] = 8.30. The fourth restricted model in theory requires ML estimation 
subject to nonlinear constraints on the parameters, which few packages do. However, 
constrained ML estimation is invariant to the way the restrictions are expressed, so 
here the same estimates are obtained as for the third restricted model, leading to the 
same LR test statistic. 

The LM test statistic is computed using (7.25), which for the Poisson model spe- 
cializes to (7.27). This statistic is computed using matrix commands, with different 
restrictions leading to the different restricted MLE estimates G. As for the LR test, 
the LM test is invariant to transformations, so the LM tests of the third and fourth 
hypotheses are equivalent. 

An asymptotically equivalent version of the LM test statistic is the statistic 
LM* given in (7.35). This can be computed as the explained sum of squares 
from the auxiliary regression (7.33). For the Poisson model 5;,=9 In f(y) /08; = 
(yi — exp(x; 6))x ji, with evaluation at the appropriate restricted MLE for the hypothe- 
sis under consideration. The statistic LM* is simpler to compute than LM, though like 
LM it requires restricted ML estimates. 

In this example with generated data the various test statistics are very similar. This 
is not always the case. In particular, the test statistic LM* can have poorer finite-sample 
size properties than LM, even if the dgp is known. Also, in applications with real data 
the dgp is unlikely to be perfectly specified, leading to divergence of the various test 
statistics even in infinitely large samples. 


7.5. Tests in Non-ML Settings 


The Wald test is the standard test to use in non-ML settings. From Section 7.2 it is a 
general testing procedure that can always be implemented, using an appropriate sand- 
wich estimator of the variance matrix of the parameter estimates. The only limitation 
is that in some applications unrestricted estimation may be much more difficult to 
perform than restricted estimation. 

The LM or score test, based on departures from zero of the gradient vector of the 
unrestricted model evaluated at the restricted estimates, can also be generalized to 
non-ML estimators. The form of the LM test, however, is usually considerably more 
complicated than in the ML case. Moreover, the simplest forms of the LM test statistic 
based on auxiliary regressions are usually not robust to distributional misspecification. 

The LR test is based on the difference between the maximized values of the objec- 
tive function with and without restrictions imposed. This usually does not generalize 
to objective functions other than the likelihood function, as this difference is usually 
not chi-square distributed. 

For completeness we provide a condensed presentation of extension of the ML tests 
to m-estimators and to efficient GMM estimators. As already noted, in most applica- 
tions use of the simpler Wald test is sufficient. 


243 


HYPOTHESIS TESTS 


7.5.1. Tests Based on m-Estimators 


Tests for m-estimators are straightforward extensions of those for ML estimators, ex- 
cept that it is no longer possible to use the information matrix equality to simplify the 
test statistics and the LR test generalizes in only very special cases. The resulting test 
statistics are asymptotically x7(h) distributed under Ho : h(@) = 0 and have the same 
noncentral chi-square distribution under local alternatives. 

Consider m-estimators that maximize Q y(0@) = N7! >=; gi(9) with first-order con- 
ditions ie >>; s:(0) = 0. Define the q x q matrices A(@) = N“! 5°, ðs;(0)/30' and 
BO) = N7!S°, s:(0)s;(0Y and the h x q matrix R(@) = ð In OT. Let 6, and 
0, denote unrestricted and restricted estimators, respectively, and let A= AG, ) 
and A= AQ@, ) with similar notation for B and R. Finally, let h= h@, ) and $; = 
Si (6, ). 

The Wald test statistic is based on closeness of h to zero. Here 


set ee 


w=f [Ra ABA R | h, (7.37) 


since from Section 5.5.1 the robust variance matrix estimate for 0, is N'A- IBA.. 
Packages with the option of robust standard errors use this more general form to com- 
pute Wald tests of statistical significance. a, 

Let g(9) = ð In Qy(0)/0 denote the gradient vector, and let ¢ = g(0,) = >); §;. 
The LM test statistic is based on the closeness of g to 0 and is given by 


Boe he Peis ste Masel eee | nee, 
LM = vg [RR (RABAR) RA g, (7.38) 


a result obtained by forming a chi-square test statistic based on (7.29), where NẸ re- 
places |ð In L/06|g, . This test is clearly not as simple to implement as a robust Wald 
test. Some examples of computation of the robust form of LM tests are given in Sec- 
tion 8.4. The standard implementations of LM tests in computer packages are often 
not robust versions of the LM test. 

The LR test does not generalize easily. It does generalize to m-estimators if 
Bo = —& Ao for some scalar œ, a weaker version of the IM equality. In such special 
cases the quasi-likelihood ratio (QLR) test statistic is 


QLR = -2N [On @,) — OnO) /@u, (7.39) 


where @, is a consistent estimate of œ obtained from unrestricted estimation (see 
Wooldridge, 2002, p. 370). The condition Bọ = —aAg holds for generalized linear 
models (see Section 5.7.4). Then the statistic QLR is equivalent to the difference of de- 
viances for the restricted and unrestricted models, a generalization of the F-test based 
on the difference between restricted and unrestricted sum of squared residuals for OLS 
and NLS estimation with homoskedastic errors. For general quasi-ML estimation, with 
Bo Æ —&Ao, the LR test statistic can be distributed as a weighted sum of chi-squares 
(see Section 8.5.3). 
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7.5.2. Tests Based on Efficient GMM Estimators 


For GMM the various test statistics are simplest for efficient GMM, meaning GMM 
estimation using the optimal weighting matrix. This poses no great practical restriction 
as the optimal weighting matrix can always be estimated, as detailed in Section 6.3.5. 

Consider GMM estimation based on the moment condition E[m;(@)] = 0. (Note 
the change in notation from Chapter 6: h(@) is being used in the current chapter to 
denote the restrictions under Hp.) Using the notation introduced in Section 6.3.5, the 
efficient uu icied GMM estimator 6, minimizes Qy(0) = gy (0S ‘gy (0), where 
gvV(8) = N- I5, m;(0) and Sy is consistent for So = V [g4 (0)]. The fedticten a 
estimator ð, is assumed to minimize Qy(@) with the same weighting matrix Sy' N> 
subject to the restriction h(@) = 

The three following test sear summarized by Newey and West (1987a) are 
asymptotically x7(h) distributed under Ho : h(@) = 0 and have the same noncentral 
chi-square distribution under local alternatives. 

The Wald test statistic as usual is based on closeness of h to zero. This yields 


Ta 


w= [Ra EOR] h, (7.40) 


since the variance of the efficient GMM estimator is N~!(G’S-!G)~! from Section 
6.3.5, where Gy(@) = 3gp (0)/30' and the carat denotes evaluation at ð. 

The first-order conditions of efficient GMM are G'S-'E = 0. The LM statistic tests 
whether this gradient vector is close to zero when instead evaluated at 0,, leading to 


LM = NXS ' (G'S G) S'E, (1.41) 


where the tilda denotes evaluation at ð, and we use the Section 6.3.3 assumption that 
VNgn (00) & NTO, Sol, so /N@S—'g & N [0, plim N-'G'S-'G]. 

For the efficient GMM estimator the difference in maximized values of the objective 
function can also be compared, leading to the difference test statistic 


D=N[Ovn(@,) — Ov @,)]. (7.42) 


Like W and LM, the statistic D is asymptotically x7(h) distributed under Ho : 
h(@) = 0 

Even in the likelihood case, this last statistic differs from the LR statistic be- 
cause it uses a different objective function. The MLE minimizes Qy(0@) = —N7! 
>=; In f(y;|9). From Section 6.3.7, the asymptotically equivalent efficient GMM es- 
timator instead minimizes the quadratic form Qy(@) = NT! ©; Si O) Os; Si (0)), 
where s;(0) = 3 In f(y:10)/30. The statistic D can be used in general, provided the 
GMM estimator used is the efficient GMM estimator, whereas the LR test can only be 
generalized for some special cases of m-estimators mentioned after (7.39). 

For _MM estimators, that is, in the just-identified GMM model, D=LM= 
N Qy(@,), so the LM and difference tests are equivalent. For D this simplification oc- 
curs because gv (Oy) = 0 and so Q vu) = 0. For LM simplification occurs in (7.41) 
as then Gy is invertible. 
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7.6. Power and Size of Tests 


The remaining sections of this chapter study two limitations in using the usual com- 
puter output to test hypotheses. 

First, a test can have little ability to discriminate between the null and alternative 
hypotheses. Then the test has low power, meaning there is a low probability of rejecting 
the null hypothesis when it is false. Standard computer output does not calculate test 
power, but it can be evaluated using asymptotic methods (see this section) or finite- 
sample Monte Carlo methods (see Section 7.7). If a major contribution of an empirical 
paper is the rejection or nonrejection of a particular hypothesis, there is no reason for 
the paper not to additionally present the power of the test against some meaningful 
alternative hypothesis. 

Second, the true size of the test may differ substantially from the nominal size of 
the test obtained from asymptotic theory. The rule of thumb that sample size N > 30 
is sufficient for asymptotic theory to provide a good approximation for inference on a 
single variable does not extend to models with regressors. Poor approximation is most 
likely in the tails of the approximating distribution, but the tails are used to obtain 
critical values of tests at common significance levels such as 5%. In practice the critical 
value for a test statistic obtained from large-sample approximation is often smaller 
than the correct critical value based on the unknown true distribution. Small-sample 
refinements are attempts to get closer to the exact critical value. For linear regression 
under normality exact critical values can be obtained, using the ¢ rather than z and the 
F rather than x? distribution, but similar results are not exact for nonlinear regression. 
Instead, small-sample refinements may be obtained through Monte Carlo methods (see 
Section 7.7) or by use of the bootstrap (see Section 7.8 and Chapter 11). 

With modern computers it is relatively easy to correct the size and investigate the 
power of tests used in an applied study. We present this neglected topic in some 
detail. 


7.6.1. Test Size and Power 


Hypothesis tests lead to either rejection or nonrejection of the null hypothesis. Correct 
decisions are made if Ho is rejected when Ab is false or if Ho is not rejected when Ho 
is true. 

There are also two possible incorrect decisions: (1) rejecting Hp when Hp is true, 
called a type I error, and (2) nonrejection of Hy when Ho is false, called a type I 
error. Ideally the probabilities of both errors will be low, but in practice decreasing 
the probability of one type of error comes at the expense of increasing the probability 
of the other. The classical hypothesis testing solution is to fix the probability of a type 
I error at a particular level, usually 0.05, while leaving the probability of a type II error 
unspecified. 

Define the size of a test or significance level 


a=Pr [type I error] 


7.43 
= Pr [reject Ho| Ho true] : ( ) 
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with common choices of «œ being 0.01, 0.05, or 0.10. A hypothesis is rejected if the test 
statistic falls into a rejection region defined so that the test significance level equals the 
specified value of œ. A closely related equivalent method computes the p-value of a 
test, the marginal significance level at which the null hypothesis is just rejected, and 
rejects Ho if the p-value is less than the specified value of œ. Both methods require only 
knowledge of the distribution of the test statistic under the null hypothesis, presented 
in Section 7.2 for the Wald test statistic. 

Consideration should also be given to the probability of a type II error. The power 
of a test is defined to be 


Power = Pr [reject Ho| Ha true] 
=1-Pr [accept Ho| H, true] (7.44) 
=1-Pr [Type Il error] : 


Ideally, test power is close to one since then the probability of a type II error is close to 
zero. Determining the power requires knowledge of the distribution of the test statistic 
under H,. 

Analysis of test power is typically ignored in empirical work, except that test proce- 
dures are usually chosen to be ones that are known theoretically to have power that, for 
given level a, is high relative to other alternative test statistics. Ideally, the uniformly 
most powerful (UMP) test is used. This is the test that has the greatest power, for given 
level œ, for all alternative hypotheses. UMP tests do exist when testing a simple null 
hypothesis against a simple alternative hypothesis. Then the Neyman—Pearson lemma 
gives the result that the UMP test is a function of the likelihood ratio. For more gen- 
eral testing situations involving composite hypotheses there is usually no UMP test, 
and further restrictions are placed such as UMP one-sided tests. In practice, power 
considerations are left to theoretical econometricians who use theory and simulations 
applied to various testing procedures to suggest which testing procedures are the most 
powerful. 

It is nonetheless possible to determine test power in any given application. In the 
following we detail how to compute the asymptotic power of the Wald test, which 
equals that of the LR and LM tests in the fully parametric case. 


7.6.2. Local Alternative Hypotheses 


Since power is the probability of rejecting Ho when H, is true, the computation 
of power requires obtaining the distribution of the test statistic under the alterna- 
tive hypothesis. For a Wald chi-square test at significance level œ the power equals 
Pr[W> x2(A)|H,]. Calculation of this probability requires specification of a particular 
alternative hypothesis, because H, : h(@) 0 is very broad. 

The obvious choice is the fixed alternative h(0) = 6, where 6 is an h x 1 finite 
vector of nonzero constants. The quantity 6 is sometimes referred to as the hypoth- 
esis error, and larger hypothesis errors lead to greater power. For a fixed alternative 
the Wald test statistic asymptotically has power one as it rejects the null hypothesis 
all the time. To see this note that if h(@) = 6 then the Wald test statistic becomes 
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infinite, since 
W =h CR) h 
Z 8 (RoN~'CoR)) 8, 


using 0 5 8o, soh = hð, ) 5 h(0) = ô, and CS Co. It follows that Ww 5 oo since 
all the terms except N are finite and nonzero. This infinite value leads to Hp being 
always rejected, as it should be, and hence having perfect power of one. 

The Wald test statistic is therefore a consistent test statistic, that is, one whose 
power goes to one as N —> oo. Many test statistics are consistent, just as many estima- 
tors are consistent. More stringent criteria are needed to discriminate among the test 
Statistics, just as relative efficiency is used to choose among estimators. 

For estimators that are root-N consistent, we consider a sequence of local alter- 
natives 


H, : (0) = 6/VN, (7.45) 


where 6 is a vector of fixed constants with 6 4 0. This sequence of alternative hy- 
potheses, called Pitman drift, gets closer to the null hypothesis value of zero as the 
sample size gets larger, at the same rate VN as used to scale up 6 to get a nonde- 
generate distribution for the consistent estimator. The alternative hypothesis value of 
h(@) therefore moves toward zero at a rate that negates any improved efficiency with 
increased sample size. For a much more detailed account of local alternatives and re- 
lated literatures see McManus (1991). 


7.6.3. Asymptotic Power of the Wald Test 


Under the sequence of local alternatives (7.45) the Wald test statistic has a nondegen- 
erate distribution, the noncentral chi-square distribution. This permits determination 
of the power of the Wald test. 

Specifically, as is shown in Section 7.7.4, under H, the Wald statistic W defined in 
(7.6) is asymptotically x?(h; A) distributed, where x*(h; A) denotes the noncentral 
chi-square distribution with noncentrality parameter 


1 / n! 
A= 56 (RoCoRo’) ô, (7.46) 
and Ro and Co are defined in (7.4) and (7.5). The power of the Wald test, the proba- 
bility of rejecting Ho given the local alternative H4 is true, is therefore 


Power = Pr[W > x2(h)|W ~ xê(h; A). (7.47) 


Figure 7.1 plots power against à for tests of a scalar hypothesis (h = 1) at the com- 
monly used sizes or significance levels of 10%, 5%, and 1%. For à close to zero the 
power equals the size, and for large 4 the power goes to one. 

These features hold also for h > 1. In particular power is monotonically increasing 
in the noncentrality parameter A defined in (7.46). Several general results follow. 

First, power is increasing in the distance between the null and alternative hypo- 
theses, as then 6 and hence A increase. 
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Test Power as a function of the ncp 


Test size = 0.10 
Test size = 0.05 


Test size = 0.01 


Test Power 


0 5 10 15 20 
Noncentrality parameter lamda 


Figure 7.1: Power of Wald chi-square test with one degree of freedom for three different 
test sizes as the noncentrality parameter ranges from 0 to 20. 


Second, for given alternative 6 power increases with efficiency of the estimator 6, 
as then Co is smaller and hence A is larger. 

Third, as the size of the test increases power increases and the probability of a type II 
error decreases. 

Fourth, if several different test statistics are all x?(h) under the null hypothesis 
and noncentral-x*(h) under the alternative, the preferred test statistic is that with the 
highest noncentrality parameter à since then power is the highest. Furthermore, two 
tests that have the same noncentrality parameter are asymptotically equivalent under 
local alternatives. 

Finally, in actual applications one can calculate the power as a function of ô. Speci- 
fically, for a specified alternative 6, an estimated noncentrality parameter a can be 
computed using (7.46) using parameter estimate 6 with associated estimates R and C. 
Such power calculations are illustrated in Section 7.6.5. 


7.6.4. Derivation of Asymptotic Power 


To obtain the distribution of W under H,, begin with the Taylor series expansion result 
(7.9). This simplifies to 


VNh@) & N[6, RoCoRo'], (7.48) 


under H,, since then V Nh(0) = 6. Thus a quadratic form centered at 6 would be 
chi-square distributed under H4. 

The Wald test statistic W defined in (7.6) instead forms a quadratic form centered 
at 0 and is no longer chi-squared distributed under H,. In general if z ~ N[p, Q], 
where rank(Q) = h, then z'Q~'z ~ x°(h; à), where x7(h; A) denotes the noncentral 
chi-square distribution with noncentrality parameter à = spar! p. Applying this re- 


sult to (7.48) yields 
Nh@y (RoCoR,)~ OR S 7h), (7.49) 
under H,, where À is defined in (7.49). 
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7.6.5. Calculation of Asymptotic Power 


To shed light on how power changes with 6, consider tests of coefficient significance 
in the scalar case. Then the noncentrality parameter defined in (7.46) is 


2 

~ 2¢~ 2%sefO])?2 ’ 
where the approximation arises because of estimation of c, the limit variance of 
JNO — 0), by N(se[6])*, where se[6] is the standard error of 8. 


Consider a Wald chi-square test of Ho : 6 = 0 against the alternative hypothesis that 
0 is within a standard errors of zero, that is, against 


Hı:0 =a x se[ĝ], 


(7.50) 


where se[0] is treated here as a constant. Then ô/ VN in (7.45) equals a x se[0], so 
that (7.50) simplifies to A = a? /2. Thus the Wald test is asymptotically x2(1; 4) under 
H, where À = a? /2. 

From Figure 7.1 it is clear for the common case of significance level tests at 5% that 
if a = 2 the power is well below 0.5, if a = 4 the power is around 0.5, and if a = 6 the 
power is still below 0.9. A borderline test of statistical significance can therefore have 
low power against alternatives that are many standard errors from zero. Intuitively, if 
@ = 2se[6] then a test of 6 = 0 against 0 = Ase[0] has power of approximately 0.5, 
because a 95% confidence interval for 6 is approximately (0, Ase[0]), implying that 
values of 6 = 0 or 0 = 4se[6] are just as likely. 

As a more concrete example, suppose 6 measures the percentage increase in wage 
resulting from a training program, and that a study finds @ = 6 with se[6] = 4. Then 
the Wald test at 5% significance level leads to nonrejection of Ho, since W = (6 /4? = 
2.25 < yell) = 3.96. The conclusion of such a study will often state that the training 
program is not statistically significant. One should not interpret this as meaning that 
there is a high probability that the training program has no effect, however, as this test 
has low power. For example, the preceding analysis indicates that a test of Ho : 0 = 0 
against H, : 0 = 16, a relatively large training effect, has power of only 0.5, since 
4 x se[@] = 16. Reasons for low power include small sample size, large model error 
variance, and small spread in the regressors. 

In simple cases, solving the inverse problem of estimating the minimum sample size 
needed to achieve a given desired level of power is possible. This is especially popular 
in medical studies. 

Andrews (1989) gives a more formal treatment of using the noncentrality parameter 
to determine regions of the parameter space against which a test in an empirical setting 
is likely to have low power. He provides many applied examples where it is easy to 
determine that tests have low power against meaningful alternatives. 


7.7. Monte Carlo Studies 


Our discussion of statistical inference has so far relied on asymptotic results. For small 
samples analytical results are rarely available, aside from tests of linear restrictions in 
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the linear regression model under normality. Small-sample results can nonetheless be 
obtained by performing a Monte Carlo study. 


7.7.1. Overview 


An example of a Monte Carlo study of the small-sample properties of a test statistic is 
the following. Set the sample size N to 40, say, and randomly generate 10,000 samples 
of size 40 under the Hp model. For each replication (sample) form the test statistic of 
interest and test Ho, rejecting Hp if the test statistic falls in the rejection region, usually 
determined by asymptotic results. 

The true size or actual size of the test statistic is simply the fraction of replications 
for which the test statistic falls in the rejection region. Ideally, this is close to the 
nominal size, which is the chosen significance level of the test. For example, if testing 
at 5% the nominal test size is 0.05 and the true size is hopefully close to 0.05. 

Determining test power in small samples requires additional simulation, with sam- 
ples generated under one or more particular specification of the possible models that 
lie in the composite alternative hypothesis H4. The power is calculated as the fraction 
of replications for that the null hypothesis is rejected, using either the same test as used 
in determining the true size, or a size-corrected version of the test that uses a rejection 
region such that the nominal size equals the true size. 

Monte Carlo studies are simple to implement, but there are many subtleties involved 
in designing a good Monte Carlo study. For an excellent discussion see Davidson and 
MacKinnon (1993). 


7.7.2. Monte Carlo Details 


As an example of a Monte Carlo study we consider statistical inference on the slope 
coefficient in a probit model. The following analysis does not rely on knowledge of 
the probit model. 

The data-generating process is a probit model, with binary regressor y equal to one 
with probability 


Pry = 1|x] = ®(6; + fox), 


where ®(-) is the standard normal cdf, x ~ AN’[O, 1], and (81, 2) = (0, 1). 

The data (y, x) are easily generated for this dgp. The regressor x is first obtained as 
a random draw from the standard normal distribution. Then, from Section 14.4.2 the 
dependent variable y is set equal to 1 if x + u > 0 and is set to O otherwise, where u 
is a random draw from the standard normal. For this dgp y = 1 roughly half the time 
and y = 0 the other half. 

In each simulation N new observations of both x and y are drawn, and the MLE 
from probit regression of y on x is obtained. An alternative is to use the same N draws 
of the regressor x in each simulation and only redraw y. The former setup corresponds 
to simple random sampling and the latter corresponds to analysis conditional on x or 
“fixed in repeated trials”; see Section 4.4.7. 

Monte Carlo studies often consider a range of sample sizes. Here we simply 
set N =40. Programs can be checked by also setting a very large value of N, 
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say N =10,000, as then Monte Carlo results should be very close to asymptotic 
results. 

Numerous simulations are needed to determine actual test size, because this de- 
pends on behavior in the tails of the distribution rather than the center. If S simulations 
are run for a test of true size œ, then the proportion of times the null hypothesis is 
correctly rejected is an outcome from S$ binomial trials with mean œ and variance 
a(l — a)/S. So 95% of Monte Carlos will estimate the test size to be in the inter- 
val a + 1.96,/a(1 — a)/S. A mere 100 simulations is not enough since, for example, 
this interval is (0.007, 0.093) when a = 0.05. For 10,000 simulations the 95% inter- 
val is much more precise, equalling (0.008, 0.012), (0.046, 0.054), (0.094, 0.106), and 
(0.192, 0.208) for œ equal to, respectively, 0.01, 0.05, 0.10, and 0.20. Here S = 10,000 
simulations are used. 

A problem that can arise in Monte Carlo simulations is that for some simulation 
samples the model may not be estimable. For example, consider linear regression on 
just an intercept and an indicator variable. If the indicator variable happens to always 
take the same value, say 0, in a simulation sample then its coefficient cannot be sepa- 
rately identified from that for the intercept. A similar problem arises in the probit and 
other binary outcome models, if all ys are 0 or all ys are 1 in a simulation sample. The 
standard procedure, which can be criticized, is to drop such simulation samples, and to 
write computer code that permits the simulation loop to continue when such a problem 
arises. In this example the problem did not arise with N = 40, but it did for N = 30. 


7.7.3. Small-Sample Bias 


Before moving to testing we look at the small-sample properties of the MLE B> and 
its estimated standard error se[B]. 

Across the 10,000 simulations Ba had mean 1.201 and standard deviation 0.452, 
whereas se[8] had mean 0.359. The MLE is therefore biased upward in small sam- 
ples, as the average of B> is considerably greater than 6, = 1. The standard errors are 
biased downward in small samples since the average of se[B5] is considerably smaller 
than the standard deviation of Bo. 


7.7.4. Test Size 
We consider a two-sided test of Ho : 62 = 1 against Ha : Bo Æ 1, using the Wald test 


a pom 
T sela] 
where se[ĝ] is the standard error of the MLE estimated using the variance matrix 
given in Section 14.3.2, which is minus the inverse of the expected Hessian. Given the 
dgp, asymptotically z is standard normal distributed and z? is chi-squared distributed. 
The goal is to find how well this approximates the small-sample distribution. 

Figure 7.2 gives the density for the S = 10,000 computed values of z, where the den- 
sity is plotted using the kernel density estimate of Chapter 9 rather than a histogram. 
This is superimposed on the standard normal density. Clearly the asymptotic result is 
not exact, especially in the upper tail where the difference is clearly large enough to 


Z = 
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Table 7.2. Wald Test Size and Power for Probit Regression Example* 


Nominal Size (a) Actual Size Actual Power Asymptotic Power 


0.01 0.005 0.007 0.272 
0.05 0.029 0.226 0.504 
0.10 0.081 0.608 0.628 
0.20 0.192 0.858 0.755 


^ The dgp for y is the Probit with Pr[y = 1] = ®(0 + 82x) and sample size N = 40. The test is a two- 
sided Wald test of whether or not the slope coefficient equals 1. Actual size is calculated from S = 
10,000 simulations with £2 = 1 and power is calculated from 10,000 simulations with 62 = 2. 


lead to size distortions when testing at, say, 5%. Also, across the simulations z has 
mean 0.114 Æ 0 and standard deviation 0.956 Æ 1. 

The first two columns of Table 7.2 give the nominal size and the actual size of 
the Wald test for nominal sizes a = 0.01, 0.05, 0.10, and 0.20. The actual size is the 
proportion of the 10,000 simulations in which |z| > Z/2, or equivalently that z? > 
xa). Clearly the actual size of the test is much less than the nominal size for œ < 
0.10. An ad hoc small-sample correction is to instead assume that z is ¢ distributed 
with 38 degrees of freedom, and reject if |z| > t./2(38). However, this leads to even 
smaller actual size, since fy/2(38) > Za/2. 

The Monte Carlo simulations can also be used to obtain size-corrected critical val- 
ues. Thus the lower and upper 2.5 percentiles of the 10,000 simulated values of z are 
—1.905 and 2.003. It follows that an asymmetric rejection region with actual size 0.05 
isz < —1.905 and z > 2.003, a larger rejection region than |z2| > 1.960. 


7.7.5. Test Power 


We consider power of the Wald test under H, : 62 = 2. We would expect the power to 
be reasonable because this value of £2 lies two to three standard errors away from the 


Monte Carlo Simulations of Wald Test 


Monte Carlo 


Standard Normal j 


Density 


Wald Test Statistic 


Figure 7.2: Density of Wald test statistic that slope coefficient equals one computed by 
Monte Carlo simulation with standard normal density also plotted for comparison. Data are 
generated from a probit regression model. 
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null hypothesis value of 62 = 1, given that se[B5] has average value 0.359. The actual 
and nominal power of the Wald test are given in the last two columns of Table 7.2. 

The actual power is obtained in the same way as actual size, being the proportion 
of the 10,000 simulations in which |z| > zq/2. The only change is that, in generating y 
in the simulation, 62 = 2 rather than 1. The actual power is very low for œ = 0.01 and 
0.05, cases where the actual size is much less than the nominal size. 

The nominal power of the Wald test is determined using the asymptotic non- 
central x?(1, A) distribution under H,, where from (7.50) A = 5(5 / JN) /se[B1 = 
5 x 1? /0.3597 ~ 3.88, since the local alternative is that Ha : 62 — 1 = 5/VN, so 
5//N = 1 for By = 2. The asymptotic result is not exact, but it does provide a useful 
estimate of the power for œ = 0.10 and 0.20, cases where the true size closely matches 
the nominal size. 


7.7.6. Monte Carlo in Practice 


The preceding discussion has emphasized use of the Monte Carlo analysis to calculate 
test power and size. A Monte Carlo analysis can also be very useful for determining 
small-sample bias in an estimator and, by setting N large, for determining that an 
estimator is actually consistent. Such Monte Carlo routines are very simple to run 
using current computer packages. 

A Monte Carlo analysis can be applied to real data if the conditional distribution 
of y given x is fully parametrized. For example, consider a probit model estimated 
with real data. In each simulation the regressors are set at their sample values, if the 
sampling framework is one of fixed regressors in repeated samples, while a new set of 
values for the binary dependent variable y needs to be generated. This will depend on 
what values of the parameters 8 are used. Let Bis paa B x denote the probit estimates 
from the original sample and consider a Wald test of Hp : 8; = 0. To calculate test size, 
generate S simulation samples by setting 6, = B, for j Æ k and setting 6; = 0, and 
then calculate the proportion of simulations in which Ho : 6; = 0 is rejected. To esti- 
mate the power of the Wald test against a specific alternative H, : Bj = 1, say, generate 
y with By = By for j # k and 6; = 1 in generating y, and calculate the proportion of 
simulations in which Hp : B; = 0 is rejected. 

In practice much microeconometric analysis is based on estimators that are not 
based on fully parametric models. Then additional distributional assumptions are 
needed to perform a Monte Carlo analysis. 

Alternatively, power can be calculated using asymptotic methods rather than finite- 
sample methods. Additionally the bootstrap, presented next, can be used to obtain size 
using a more refined asymptotic theory. 


7.8. Bootstrap Example 


The bootstrap is a variant of Monte Carlo simulation that has the attraction of being 
implementable with fewer parametric assumptions and with little additional program 
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code beyond that required to estimate the model in the first place. Essential ingredients 
for the bootstrap to be valid are that the estimator actually has a limit distribution and 
that the bootstrap resamples quantities that are iid. 

The bootstrap has two general uses. First, it can be used as an alternative way to 
compute statistics without asymptotic refinement. This is particularly useful for com- 
puting standard errors when analytical formulas are complex. Second, it can be used 
to implement a refinement of the usual asymptotic theory that may provide a better 
finite-sample approximation to the distribution of test statistics. 

We illustrate the bootstrap to implement a Wald test, ahead of a complete treatment 
in Chapter 11. 


7.8.1. Inference Using Standard Asymptotics 


Consider again a probit example with binary regressor y equal to one with probability 
p = ®(y + Bx), where ®(-) is the standard normal cdf. Interest lies in testing Ho : 
B = l against H, : B Æ 1 at significance level 0.05. The analysis here does not require 
knowledge of the probit model. 

One sample of size N = 30 is generated. Probit ML estimation yields B = 0.817 
and S3= 0.294, where the standard error is based on a so the test statistic z = 
(1 — 0.817)/0.294 = —0.623. 

Using standard asymptotic theory we obtain 5% critical values of —1.96 and 1.96, 
since Z.925 = 1.96, and Hp is not rejected. 


7.8.2. Bootstrap without Asymptotic Refinement 


The departure point of the bootstrap method is to resample from an approximation to 
the population; see Section 11.2.1. The paired bootstrap does so by resampling from 
the original sample. 

Thus form B aap of size N by drawing with replacement from the orig- 
inal data {(y;, xi), i = 1,..., N}. For example, the first pseudo-sample of size 30 may 
have Oi xı) once, (y2, ov not at all, (y3, x3) twice, and so on. This yields B estimates 
By ise B: g of the parameter of interest p, that can be used to estimate features of the 
disteibntion of the original estimate B. 

For example, suppose the computer program used to estimate a probit model reports 
B but not the standard error sg. The bootstrap solves this problem since we can use 
the estimated standard deviation 53 voot of Bi, ysy Ba from the B bootstrap pseudo- 
samples. Given this standard error estimate it is possible to perform a Wald hypothesis 
test on £. 

For the probit Wald test example, the resulting bootstrap estimate of the standard 
error of B is 0.376, leading to z = (1 — 0.817)/0.376 = —0.487. Since —0.487 lies in 
(—1.96, 1.96) we do not reject Ho at 5%. 

This use of the bootstrap to test hypotheses does not lead to size improvements in 
small samples. However, it can lead to great time savings in many applications if it is 
difficult to otherwise obtain the standard errors for an estimator. 
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7.8.3. Bootstrap with Asymptotic Refinement 


Some bootstraps can lead to a better asymptotic approximation to the distribution of 
z. This is likely to lead to finite-sample critical values that are better in the sense that 
the actual size is likely to be closer to the nominal size of 0.05. Details are provided in 
Chapter 11. Here we illustrate the method. 

Again form B pseudo-samples of size N by drawing with replacement from the 
original data. Estimate the probit model in each pseudo-sample and for the bth 
pseudo-sample compute z; = E; b - P) / spe , where B is the original estimate. The 
bootstrap distribution for the original test statistic z is then the empirical distribution 
of z7,..., Z% rather than the standard normal. The lower and upper 2.5 percentiles of 
this empirical distribution give the bootstrap critical values. 

For the example here with B = 1,000 the lower and upper 2.5 percentiles of the 
empirical bootstrap distribution of z were found to be —2.62 and 1.83. The bootstrap 
critical values for testing at 5% are then —2.62 and 1.83, rather than the usual +1.96. 
Since the initial sample test statistic z = —0.623 lies in (—2.62, 1.83) we do not reject 
Ho : B = 1. A bootstrap p—value can also be computed. 

Unlike the bootstrap in the previous section, an asymptotic improvement occurs 
here because the studentized test statistic z is asymptotically pivotal (see Section 
11.2.3) whereas the estimator B is not. 


7.9. Practical Considerations 


Microeconometrics research places emphasis on statistical inference based on min- 
imal distributional assumptions, using robust estimates of the variance matrix of an 
estimator. There is no sense in robust inference, however, if failure of distributional 
assumptions leads to the more serious complication of estimator inconsistency as can 
happen for some though not all ML estimators. 

Many packages provide a “robust” standard errors option in estimator commands. 
In micreconometrics packages robust often means heteroskedastic consistent and does 
not guard against other complications such as clustering, see Section 24.5, that can 
also lead to invalid statistical inference. 

Robust inference is usually implemented using a Wald test. The Wald test has the 
weakness of invariance to reparametrization of nonlinear hypotheses, though this may 
be diminished by performing an appropriate bootstrap. Standard auxiliary regressions 
for the LM test and implementations of LM tests on computer packages are usually 
not robustified, though in some cases relatively simple robustification of the LM test 
is possible (see Section 8.4). 

The power of tests can be weak. Ideally, power against some meaningful alternative 
would be reported. Failing this, as Section 7.6 indicates, one should be careful about 
overstating the conclusions from a hypothesis test unless parameters are very precisely 
estimated. 

The finite sample size of tests derived from asymptotic theory is also an issue. The 
bootstrap method, detailed in Chapter 11, has the potential to yield hypothesis tests 
and confidence intervals with much better finite-sample properties. 
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Statistical inference can be quite fragile, so these issues are of importance to the 
practitioner. Consider a two-tailed Wald test of statistical significance when 6 = 1.96, 
and assume the test statistic is indeed standard normal distributed. If sọ = 1.0 then 
t = 1.96 and the p—value is 0.050. However, the true p—value is a much higher 0.117 
if the standard error was underestimated by 20% (so correct t = 1.57), and a much 
lower 0.014 if the standard error was overestimated by 20% (so t = 2.35). 


7.10. Bibliographic Notes 


The econometrics texts by Gouriéroux and Monfort (1989) and Davidson and MacKinnon 
(1993) give quite lengthy treatment of hypothesis testing. The presentation here considers only 
equality restrictions. For tests of inequality restrictions see Gouriéroux, Holly, and Monfort 
(1982) for the linear case and Wolak (1991) for the nonlinear case. For hypothesis testing when 
the parameters are at the boundary of the parameter space under the null hypothesis the tests 
can break down; see Andrews (2001). 


7.3 A useful graphical treatment of the three classical test procedures is given by Buse (1982). 

7.5 Newey and West (1987a) present extension of the classical tests to GMM estimation. 

7.6 Davidson and MacKinnon (1993) give considerable discussion of power and explain the 
distinction between explicit and implicit null and alternative hypotheses. 

7.7 For Monte Carlo studies see Davidson and MacKinnon (1993) and Hendry (1984). 

7.8 The bootstrap method due to Efron (1979) is detailed in Chapter 11. 


Exercises 


7-1 Suppose a sample yields estimates 0, = 5, 02 = 3 with asymptotic variance es- 
timates 4 and 2 and the correlation coefficient between 04 and 62 equals 0.5. 
Assume asymptotic normality of the parameter estimates. 


(a) Test Ho : 6,e% = 100 against Ha : 6; 4 100 at level 0.05. 
(b) Obtain a 95% confidence interval for y = 6,e%. 


7-2 Consider NLS regression for the model y= exp(a + 6x) + £, where a, 6, and 
x are scalars and £ ~ N[0, 1]. Note that for simplicity c? = 1 and need not be 
estimated. We want to test Hp : 6 = 0 against Ha : 6 40. 


(a) Give the first-order conditions for the unrestricted MLE of « and £. 

(b) Give the asymptotic variance matrix for the unrestricted MLE of «œ and £. 

(c) Give the explicit solution for the restricted MLE of a and £. 

(d) Give the auxiliary regression to compute the OPG form of the LM test. 

(e) Give the complete expression for the original form of the LM test. Note that 
it involves derivatives of the unrestricted log-likelihood evaluated at the re- 
stricted MLE of a and £. [This is more difficult than parts (a)—(d).] 


7-3 Suppose we wish to choose between two nested parametric models. The relation- 
ship between the densities of the two models is that g(y|x,8,a = 0) = f(y|x,B), 
where for simplicity both 6 and « are scalars. If g is the correct density then the 
MLE of £ based on density f is inconsistent. A test of model f against model 
gis a test of Ho : æ = 0 against H,: «+40. Suppose ML estimation yields the 
following results: (1) model f: B = 5.0, se[A] = 0.5, and In L = —106; (2) model 
g: B= 3.0, se[p] = 1.0, @ = 2.5, se[a] = 1.0, and In L = —103. Not all of the 
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following tests are possible given the preceding information. If there is enough 
information, perform the tests and state your conclusions. If there is not enough 
information, then state this. 


(a) Perform a Wald test of Hp at level 0.05. 

(b) Perform a Lagrange multiplier test of Ho at level 0.05. 
(c) Perform a likelihood ratio test of Hp at level 0.05. 

(d) Perform a Hausman test of Ho at level 0.05. 


Consider test of Ho : u = 0 against Ha : u #0 at nominal size 0.05 when the 
dgp is y ~ N[, 100], so the standard deviation is 10, and the sample size is 
N = 10. The test statistic is the usual t-test statistic t = 77/,/s/10, where $? = 
(1/9) $; (y; — Y}. Perform 1,000 simulations to answer the following. 


(a) Obtain the actual size of the t-test if the correct finite-sample critical values 
+to25(8) = +2.306 are used. Is there size distortion? 

(b) Obtain the actual size of the t-test if the asymptotic approximation critical 
values +Z925 = +1.960 are used. Is there size distortion? 

(c) Obtain the power of the t-test against the alternative Ha : u = 1, when the 
critical values +to25(8) = +2.306 are used. Is the test powerful against this 
particular alternative? 


Use the health expenditure data of Section 16.6. The model is a probit regression 

of DMED, an indicator variable for positive health expenditures, against the 17 

regressors listed in the second paragraph of Section 16.6. You should obtain the 

estimates given in the first column of Table 16.1. Consider joint test of the statisti- 

cal significance of the self-rated health indicators HLTHG, HLTHF, and HLTHP at 

level 0.05. 

(a) Perform a Wald test. 

(b) Perform a likelihood ratio test. 

(c) Perform an auxiliary regression to implement an LM test. [This will require 
some additional coding.] 
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CHAPTER 8 


Specification Tests and Model 
Selection 


8.1. Introduction 


Two important practical aspects of microeconometric modeling are determining 
whether a model is correctly specified and selecting from alternative models. For these 
purposes it is often possible to use the hypothesis testing methods presented in the pre- 
vious chapter, especially when models are nested. In this chapter we present several 
other methods. 

First, m-tests such as conditional moment tests are tests of whether moment con- 
ditions imposed by a model are satisfied. The approach is similar in spirit to GMM, 
except that the moment conditions are not imposed in estimation and are instead used 
for testing. Such tests are conceptually very different from the hypothesis tests of 
Chapter 7, as there is no explicit statement of an alternative hypothesis model. 

Second, Hausman tests are tests of the difference between two estimators that are 
both consistent if the model is correctly specified but diverge if the model is incorrectly 
specified. 

Third, tests of nonnested models require special methods because the usual hypoth- 
esis testing approach can only be applied when one model is nested within another. 

Finally, it can be useful to compute and report statistics of model adequacy that are 
not test statistics. For example, an analogue of R* may be used to measure the good- 
ness of fit of a nonlinear model. 

Ideally, these methods are used in a cycle of model specification, estimating, testing, 
and evaluation. This cycle can move from a general model toward a specific model, or 
from a specific model to a more general one that is felt to capture the most important 
features of the data. 

Section 8.2 presents m-tests, including conditional moment tests, the information 
matrix test, and chi-square goodness of fit tests. The Hausman test is presented in 
Section 8.3. Tests for several common misspecifications are discussed in Section 8.4. 
Discrimination between nonnested models is the focus of Section 8.5. Commonly used 
convenient implementations of the tests of Sections 8.2-8.5 can rely on strong distri- 
butions and/or perform poorly in finite samples. These concerns have discouraged use 
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of some of these tests, but such concerns are outdated because in many cases the boot- 
strap methods presented in Chapter 11 can correct for these weaknesses. Section 8.6 
considers the consequences of testing a model on subsequent inference. Model diag- 
nostics are presented in the stand-alone Section 8.7. 


8.2. m-Tests 


m-Tests, such as conditional moment tests, are a general specification testing proce- 
dure that encompasses many common specification tests. The tests are easily imple- 
mented using auxiliary regressions when estimation is by ML, a situation where tests 
of model assumptions are especially desirable. Implementation is usually more diffi- 
cult when estimators are instead based on minimal distributional assumptions. 

We first introduce the test statistic and computational methods, followed by leading 
examples and an illustration of the tests. 


8.2.1. m-Test Statistic 
Suppose a model implies the population moment condition 
Ho : E[m;(w;, 0)] = 0, (8.1) 


where w is a vector of observables, usually the dependent variable y and regressors 
x and sometimes additional variables z, 0 is a q x 1 vector of parameters, and m; (-) 
is an h x 1 vector. A simple example is that E[(y — x’G)z] = 0 if z can be omitted in 
the linear model y = x’G + u. Especially for fully parametric models there are many 
candidates for m;(-). 

An m-test is a test of the closeness to zero of the corresponding sample moment 


N 
my(0) = N7! 5 m;(w;, 0). (8.2) 
i=l 


This approach is similar to that for the Wald test, where h(@) = 0 is tested by testing 
the closeness to zero of h(6). 

A test statistic is obtained by a method similar to that detailed in Section 7.2.4 for 
the Wald test. In Section 8.2.3 it is shown that if (8.1) holds then 


JNitiy(0) S NTO, Vn], (8.3) 


where Vm defined later in (8.10) is more complicated than in the case of the Wald test 
because m; (W;, 6) has two sources of stochastic variation as both w; and 6 are random. 

A chi-square test statistic can then be obtained by taking the corresponding 
quadratic form. Thus the m-test statistic for (8.1) is 


M = NAs @) Vz my (0), (8.4) 
which is asymptotically x7(rank[Vm]) distributed if the moment conditions (8.1) are 


correct. An m-test rejects the moment conditions (8.1) at significance level œ if M > 
x2(h) and does not reject otherwise. 
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A complication is that Vm may not be of full rank h. For example, this is the case 
if the estimator 6 itself sets a linear combination of components of my (6) to 0. In 
some cases, such as the OIR test, Vin is still of full rank and M can be computed but 
the chi-square test statistic has only rank[V,,] degrees of freedom. In other cases Vin 
itself is not of full rank. Then it is simplest to drop (h — rank[V,,]) of the moment 
conditions and perform an m-test using just this subset of the moment conditions. Al- 
ternatively, the full set of moment conditions can be used, but Va! in (8.4) is replaced 
by Vz, the generalized inverse of Vm. The Moore-Penrose generalized inverse V~ 
of a matrix V satisfies VV V = V, V- VV = V`, (VV yY = VV, and (V` VY = 
V-V. When Vm is less than full rank then strictly speaking (8.3) no longer helds, 
since the multivariate normal requires full rank Vm, but (8.4) still holds given these 
adjustments. 

The m-test approach is conceptually very simple. The moment restriction (8.1) is 
rejected if a quadratic form in the sample estimate (8.2) is far enough from zero. The 
challenges are in calculating M since Vm can be quite complex (see Section 8.2.2), 
selecting moments m(-) to test (see Sections 8.2.3—8.2.6 for leading examples), and 
interpreting reasons for rejection of (8.1) (see Section 8.2.8). 


8.2.2. Computation of the m-Statistic 


There are several ways to compute the m-statistic. 

First, one can always directly compute Vn. and hence M, using the consistent es- 
timates of the components of Vm given in Section 8.2.3. Most practitioners shy away 
from this approach as it entails matrix computations. 

Second, the bootstrap can always be used (see Section 11.6.3), since the bootstrap 
can provide an estimate of Vm that controls for all sources of variation in mð) = 
N7! oa m;(w;, 0). 

Third, in some cases auxiliary regressions similar to those for the LM test given 
in Section 7.3.5 can be run to compute asymptotically equivalent versions of M that 
do not require computation of Vn. These auxiliary regressions may in turn be boot- 
strapped to obtain an asymptotic refinement (see Section 11.6.3). We present several 
leading auxiliary regressions. 


Auxiliary Regressions Using the ML Estimator 


Model specification tests are especially desirable when inference is done within the 
likelihood framework, as in general any misspecification of the density can lead to in- 
consistency of the MLE. Fortunately, an m-test is easily implemented when estimation 
is by maximum likelihood. 

Specifically, when @ is the MLE, generalizing the LM test result of Section 7.3.5 
(see Section 8.2.3) yields an asymptotically equivalent version of the m-test is obtained 
from the auxiliary regression 


1 = M6 + S/Y + ui, (8.5) 
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where M; = m;(y;, xi, Ou), S; = In f(yi|x;, 9)/90|g,, is the contribution of the 
ith observation to the score and f(¥;|x;, 0) is the conditional density function, by 
calculating 


M* = NR, (8.6) 


where R? is the uncentered R? defined at the end of Section 7.3.5. Equivalently, M* 
equals ESS,,, the uncentered explained sum of squares (the sum of squares of the fitted 
values) from regression (8.5), or M* equals N — RSS, where RSS is the residual sum 
of squares from regression (8.5). M* is asymptotically x?(h) under Ho. 

The test statistic M* is called the outer product of the gradient form of the m-test, 
and it is a generalization of the auxiliary regression for the LM test (see Section 7.3.5). 
Although the OPG form can be easily computed, it has poor small-sample properties 
with large size distortions. Similar to the LM test, however, these small-sample prob- 
lems can be greatly reduced by using bootstrap methods (see Section 11.6.3). 

The test statistic M* may also be appropriate in some non-ML settings. The auxil- 
iary regression is applicable whenever E[dm/06'] = —E[ms’] (see Section 8.2.3). By 
the generalized IM equality (see Section 5.6.3), this condition holds for the MLE when 
expectation is with respect to the specified density f(-). It can also hold under weaker 
distributional assumptions in some cases. 


Auxiliary Regressions When E[dm/06’] = 0 
In some applications m;(w;, 9) satisfies 
E | am; (w, 0)/26'|,, | = 0. (8.7) 


in addition to (8.1). 

Then it can be shown that the asymptotic distribution of /Nmjy(@) is the same 
as that of v Nmy(69), so Vm = plim N7! >; Miom‘), which can be consistently esti- 
mated by Vm = N`! )°; mimi. The test statistic can be computed in a similar manner 
to (8.5), except the auxiliary regression is more simply 


1 =ó + u;, (8.8) 


with test statistic M** equal to N times the uncentered R°. 

This auxiliary regression is valid for any root-N consistent estimator ð, not just 
the MLE, provided (8.7) holds. The condition (8.7) is met in a few examples; see 
Section 8.2.9 for an example. 

Even if (8.7) does not hold the simpler regression (8.8) might still be run as a guide, 
as it places a lower bound on the correct value of M, the m-test statistic. If this simpler 
regression leads to rejection then (8.1) is certainly rejected. 


Other Auxiliary Regressions 


Alternative auxiliary regressions to (8.5) and (8.8) are possible if m(y, x, 0) and 
s(y, x, 0) can be appropriately factorized. 
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First, if s(y, x, 0) = g(x, @)r(y, x, 0) and m(y, x, 0) = h(x, 8)r(y, x, 0) for some 
common scalar function r(-) with V[r(y, x, 0)] = 1 and estimation is by ML, then an 
asymptotically equivalent regression to (8.5) is N R? from regression of 7; ong; and hy. 

Second, if m(y, x, 0) = h(x, 0)v(y, x, 9) for some scalar function v(-) with 
V[v(y, x, 9)] = 1 and E[dm/d6’] = 0, then an asymptotically equivalent regression 
to (8.8) is N R? from regression of v; on h,. For further details see Wooldridge (1991). 

Additional auxiliary regressions exist in special settings. Examples are given in 
Section 8.4, and White (1994) gives a quite general treatment. 


8.2.3. Derivations for the m-Test Statistic 


To avoid the need to compute Vm, the variance matrix in (8.3), m-tests are usually 
implemented using auxiliary regressions or bootstrap methods. For completeness this 
section derives the actual expression for Vm and provides justification for the auxiliary 
regressions (8.5) and (8.8). 

The key is obtaining the distribution of m my (0) defined in (8.2). This is complicated 
because my() i is stochastic for two reasons: the random variables w; and evaluation 
at the estimator 0. 

Assume that 6 is an m-estimator or estimating equations estimator that solves 


1< ~ 
<) sw, 8) = 0, (8.9) 
N i=l 


for some function s(-), here not necessarily 0 In f(y|x, 9)/00, and make the usual 
cross-section assumption that data are independent over i. Then we shall show that 


J/Nimy(@) E NTO, Vm], as in (8.3), where 
Vin = HoJoH, (8.10) 
the h x (h + q) matrix 
Ho = [In — CoAg'], (8.11) 
where Co = plim N`! X; dmjo/d0’ and Ao = plim NT! Y`; dsjo/06’, and the (h + 
q) x (h + q) matrix 


Dizi Mom) X mios, 
i E (8.12) 


N 1 N 1 
Ž i= Siom; —- D7 ;=1 Si08jo 


where mio = m;(w;, 0o) and Sio = $;(W;, 90). 
To derive (8.10), take a first-order Taylor series expansion around 0p to obtain 


Jo = plim N7! l 


-A 3m (0 A 
JNmy(0) = V/Nmy(00) + mu 0 NO 0o) + 0,(1). (8.13) 
For Ô defined in (8.9) this implies that 
oe sp 1 N 1 N 
Nmy(0) = — X m,(@) — CA7! —= Y `sio +0, (1), 8.14 
1 O) =e leo) oho! ay 2,80 + nC) (8.14) 
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where we use my = N`! X`; m;, my /30' = N`! X`; om; /30' 4. Co, and /N@ — 
0o) has the same limit distribution as Ag tN >>; Sio by applying the usual first-order 
Taylor series expansion to (8.9). Equation (8.14) can be written as 


1 N 
By TH Di=1 Mio 
VNitiy@) = [n —CoAg'] | YY +0,(1). (8.15) 
Yn Ž i= Sio 


Equation (8.10) follows by application of the limit normal product rule (Theo- 
rem A.17) as the second term in the product in (8.15) has limit normal distribution 
under Hp with mean 0 and variance Jo. 

To compute M in (8.4), a consistent estimate Vim for Vm can be obtained by replac- 
ing each component of Vm by a consistent estimate. For example, Cp can be consis- 
tently estimated by C=! X, am; /30' g» and so on. Although this can always be 
done, using auxiliary regressions is easier when they are available. 

First, consider the auxiliary regression (8.5) when 8 is the MLE. By the generalized 
IM equality (see Section 5.6.3) E[dmjo/90’] = —E[m;os;ọ], where for the MLE we 
specialize to s; = 3 In f (yi, x;, 9)/00". Considerable simplification occurs since then 
Co = —plimN~! >>; mjos;,. and Ao = —plimN~! $`; s;08;,, which also appear in the 
Jo matrix. This leads to the OPG form of the test. For further details see Newey (1985) 
or Pagan and Vella (1989). 

Second, for the auxiliary regression (8.8), note that if E[dmjo/ 30'] = 0 then Co = 
0, so Ho = [I), 0] and hence HoJoHy = plimN~! >>; miom/,. 


8.2.4. Conditional Moment Tests 


Conditional moment tests, due to Newey (1985) and Tauchen (1985), are m-tests of 
unconditional moment restrictions that are obtained from an underlying conditional 
moment restriction. 

As an example, consider the linear regression model y = x’ 6 + u. A standard as- 
sumption for consistency of the OLS estimator is that the error has conditional mean 
zero, or equivalently the conditional moment restriction 


Ely — x’Blx] = 0. (8.16) 


In Chapter 6 we considered using some of the implied unconditional moment restric- 
tions as the basis of MM or GMM estimation. In particular (8.16) implies that E[x(y — 
x’ 3)] = 0. Solving the corresponding sample moment condition )>; x;(y; — x; 3) = 0 
leads to the OLS estimator for B. However, (8.16) implies many other moment condi- 
tions that are not used in estimation. Consider the unconditional moment restriction 


E[g(x)(y — x'B)] = 0, 


where the vector g(x) should differ from x, already used in OLS estimation. For exam- 
ple, g(x) may contain the squares and cross-products of the components of the regres- 
sor vector x. This suggests a test based on whether or not the corresponding sample 
moment my(B) = N! X gani — xB) is close to zero. 
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More generally, consider the conditional moment restriction 
E[r, x, 0)|x] = 0, (8.17) 


for some scalar function r(-). The conditional (CM) moment test is an m-test based 
on the implied unconditional moment restrictions 


E[g(x)r(y, x, )] = 0, (8.18) 


where g(x) and/or r(y, x, 0) are chosen so that these restrictions are not already used 
in estimation. 

Likelihood-based models lead to many potential restrictions. For less than fully 
parametric models examples of r(y, x, 0) include y — u(x, 8), where u(-) is the spec- 
ified conditional mean function, and (y — u(x, 0))? — o?(x, 0), where o7(x, 0) is a 
specified conditional variance function. 


8.2.5. White’s Information Matrix Test 


For ML estimation the information matrix equality implies moment restrictions that 
may be used in an m-test, as they are usually not imposed in obtaining the MLE. 
Specifically, from Section 5.6.3 the IM equality implies 


E[Vech [D;(y;, X;, 9o)]] = 9, (8.19) 
where the q x q matrix D; is given by 


inf, odalnf, Inf; 
0000 00 ə’ 
and the expectation is taken with respect to the assumed conditional density f; = 
Jf (i|x;, 9). Here Vech is the vector-half operator that stacks the columns of the ma- 
trix D; in the same way as the Vec operator, except that only the g(g + 1)/2 unique 
elements of the symmetric matrix D; are stacked. 

White (1982) proposed the information matrix test of whether the corresponding 
sample moment 


D; (yi, Xi, 90) = (8.20) 


N 
dy (6) = N`! ` Vech{Dj(y;, xi, Om) (8.21) 


i=1 


is close to zero. Using (8.4) the IM test statistic is 
IM = Ndy(0)V~'dy(), (8.22) 


where the expression for Vv given in White (1982) is quite complicated. A much easier 
way to implement the test, due to Lancaster (1984) and Chesher (1984), is to use the 
auxiliary regression (8.5), which is applicable since the MLE is used in (8.21). 

The IM test can also be applied to a subset of the restrictions in (8.19). This should 
be done if q is large as then the number of restrictions q(q + 1)/2 being tested is very 
large. 

Large values of the IM test statistic lead to rejection of the restrictions of the 
IM equality and the conclusion that the density is incorrectly specified. In general 
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this means that the ML estimator is inconsistent. In some special cases, detailed in 
Section 5.7, the MLE may still be consistent though standard errors need then to be 
based on the sandwich form of the variance matrix. 


8.2.6. Chi-Square Goodness-of-Fit Test 


A useful specification test for fully parametric models is to compare predicted prob- 
abilities with sample relative frequencies. The model is a poor one if these differ 
considerably. 

Begin with discrete iid random variable y that can take one of J possible values 
with probabilities p1, p2,..., PJ, Di | Pj = 1. The correct specification of the prob- 
abilities can be tested by testing the equality of theoretical frequencies Np; to the 
observed frequencies N pj, where p; is the fraction of the sample that takes the jth 
possible value. The Pearson chi-square goodness-of-fit test (PCGF) statistic is 


J = 2 
ND; — Np, 
PCGF = X` WPAN (8.23) 
= Npj 


This statistic is asymptotically x7(J — 1) distributed under the null hypothesis that the 
probabilities p1, p2,..., pz are correct. The test can be extended to permit the prob- 
abilities to be predicted from regression (see Exercise 8.2). Consider a multinomial 
model for discrete y with probabilities pj; = p;j(xi, 0). Then p; in (8.23) is replaced 
by P; = N7! Ñ; Fix, @) and if @ is the multinomial MLE we again get a chi-square 
distribution, but with reduced number of degrees of freedom (J — dim(@) — 1) result- 
ing from the estimation of 0 (see Andrews, 1988a). 

For regression models other than multinomial models, the statistic PCGF in (8.23) 
can be computed by grouping y into cells, but the statistic PCGF is then no longer 
chi-square distributed. Instead, a closely related m-test statistic is used. To derive this 
statistic, break the range of y into J mutually exclusive cells, where the J cells span 
all possible values of y. Let dj;(y;) be an indicator variable equal to one if y; € cell 
j and equal to zero otherwise. Let p;;(x;, 0) = D ecellj FS Oi|x;, Ody; be the predicted 
probability that observation i falls in cell j, where f(y|x, 0) is the conditional density 
of y and to begin with we assume the parameter vector @ is known. If the conditional 
density is correctly specified, then 


Eldij(yi) — pij(%i, 0)] = 0, J= Ayes d (8.24) 
Stacking all J moments in obvious vector notation, we have 
Eld: O) — pi(x;, 8)] = 0, (8.25) 


where d; and p; are J x 1 vectors with jth entries d;; and p;;. This suggests an m-test 
of the closeness to zero of the corresponding sample moment 


N 
dpy(@) = N X (dO) — pilxi, 8), (8.26) 


i=l 


which is the difference between the vector of sample relative frequencies N~! X; d; 
and the vector of predicted frequencies N`! $; p;. Using (8.5) we obtain the 
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chi-square goodness-of-fit (CGF) test statistic of Andrews (1988a, 1988b): 
CGF = Ndp 6)'V-'dp,,@), (8.27) 


where the expression for Ẹ is quite complicated. The CGF test statistic is easily com- 
puted using the auxiliary regression (8.5), with M; = d; — p;. This auxiliary regression 
is appropriate here because a fully parametric model is being tested and so @ will be 
the MLE. 

One of the categories needs to be dropped because of the restriction that probabil- 
ities sum to one, yielding a test statistic that is asymptotically x°(J — 1) under the 
null hypothesis that f(y|x, 0) is correctly specified. Further categories may need to 
be dropped in some special cases, such as the multinomial example already discussed 
after (8.23). In addition to reporting the calculated test statistic it can be informative to 
report the components of N~! X, d; and N~! >>, p;. 

The relevant asymptotic theory is provided by Andrews (1988a), with a simpler 
presentation and several applications given in Andrews (1988b). For simplicity we 
presented cells determined by the range of y, but the partitioning can be on both y 
and x. Cells should be chosen so that no cell has only a few observations. For further 
details and a history of this test see these articles. 

For continuous random variable y in the iid case a more general test than the SCGF 
test is the Kolmogorov test; this uses the entire distribution of y, not just cells formed 
from y. Andrews (1997) presents a regression version of the Kolmogorov test, but it is 
much more difficult to implement than the CGF test. 


8.2.7. Test of Overidentifying Restrictions 


Tests of overidentifying assumptions (see Section 6.3.8) are examples of m-tests. 

In the notation of Chapter 6, the GMM estimator is based on the assumption that 
E[h(w;, 00)] = 0. If the model is overidentified, then only q of these moment re- 
strictions are used in estimation, leading to (r — q) linearly dependent orthogonal- 
ity conditions, where r = pare )], that can | be used to form an m-test. Then we 
use M in (8.4), where My = N |Y, a 6). As shown in Section 6.3.9, if ĝis 
the optimal GMM estimator then MnO Sy nð, where Sv= = NT DDS f; hi, 
asymptotically x?(r — q) distributed. A more intuitive linear IV example is ene in 
Section 8.4.4. 


8.2.8. Power and Consistency of Conditional Moment Tests 


Because there is no explicit alternative hypothesis, m-tests differ from the tests of 
Chapter 7. 

Several authors have given examples where the IM test can be shown to be equiv- 
alent to a conventional LM test of null against alternative hypotheses. Chesher (1984) 
interpreted the IM test as a test for random parameter heterogeneity. For the linear 
model under normality, A. Hall (1987) showed that subcomponents of the IM test 
correspond to LM tests of heteroskedasticity, symmetry, and kurtosis. Cameron and 
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Trivedi (1998) give some additional examples and reference to results for the linear 
exponential family. 

More generally, m-tests can be interpreted in a conditional moment framework 
as follows. Begin with an added variable test in a linear regression model. Suppose 
we want to test whether 3, = 0 in the model y = x| 6; + x5, + u. This is a test of 
Ho : Ely — x} B,|x] = 0 against H, : Ely — x} 61 |x] = x5). The most powerful test 
of Ho : Bz = 0 in regression of y — x; 6; on x3 is based on the efficient GLS estimator 


-1 
N ! N 1 

~ Xj X); Xi (Yi — X461) 

B= l} =| X IXIL — i 


i=l i izi 0; 


where o? = V[y;|x;] under Ho and independence over i is assumed. This test is equiv- 
alent to a test based on the second sum alone, which is an m-test of 


E beard =0. (8.28) 
G; 


Reversing the process, we can interpret an m-test based on (8.28) as a CM test of 
Ho : Ely — x| 61 |x] = 0 against H, : Ely — x) G,|x] = x}. Also, an m-test based 
on E[x> (y — x, G1) = 0 can be interpreted as a CM test of Ho : E[y — x; G,|x] = 0 
against H, : Ely — x ,|x] = oX b2» where o = V[y|x] under Ho. 

More generally, suppose we start with the conditional moment restriction 


E[r (yi, Xj, 0)|x;] = 0, (8.29) 


for some scalar function r(-). Then an m-test based on the unconditional moment 
restriction 


E[g(x; )r Yi, Xi, 0)] = 0 (8.30) 
can be interpreted as a CM test with null and alternative hypotheses 
Ho : E[r (yi, Xi, 8)|x;] = 0, (8.31) 
Ha : Er Oi Xi, Oxi] = 07 gaY, 


where o? = V[r (yi, Xi, 9)|x;] under Ho. 

This approach gives a guide to the directions in which a CM test has power. Al- 
though (8.30) suggests power is in the general direction of g(x), from (8.31) a more 
precise statement is that it is instead the direction of g(x) multiplied by the variance 
of r(y, x, 0). The distinction is important because many cross-section applications this 
variance is not constant across observations. For further details and references see 
Cameron and Trivedi (1998), who call this a regression-based CM test. The approach 
generalizes to vector r(-), though with more cumbersome algebra. 

An m-test is a test of a finite number of moment conditions. It is therefore possible to 
construct a dgp for which the underlying conditional moment condition, such as that in 
(8.29), is false yet the moment conditions are satisfied. Then the CM test is inconsistent 
as it fails to reject with probability one as N — oo. Bierens (1990) proposed a way 
to specify g(x) in (8.30) that ensures a consistent conditional moment test, for tests 
of functional form in the nonlinear regression model where r(y, x, 0) = y — f(x, 8). 
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Ensuring the consistency of the test does not, however, ensure that it will have high 
power against particular alternatives. 


8.2.9. m-Tests Example 


To illustrate various m-tests we consider the Poisson regression model introduced in 
Section 5.2, with Poisson density f(y) = e7” w’ /y! and u = exp(x' 6). 
We wish to test 


Ho : Elm(y, x, 6)] = 0, 


for various choices of m(-). This test will be conducted under the assumption that the 
dgp is indeed the specified Poisson density. 


Auxiliary Regressions 


Since estimation is by ML we can use the m-test statistic M* computed as N times the 
uncentered R? from auxiliary regression (8.5), where 


1 = MOr, x), BVE + Oi — exp(x,B))xy+u;, (8.32) 


since s = |ð In f(y)/0Blg =(y— exp(x’B))x and B is the MLE. Under Hp the test is 
x?(dim(m)) distributed. 
An alternative is the M** statistic from auxiliary regression 


1 = m(y, x, z, 8Y +u. (8.33) 


This test is asymptotically equivalent to LM* if m(-) is such that E[dm/d] = 0, but 
otherwise it is not chi-squared distributed. 


Moments Tested 


Correct specification of the conditional mean function, that is, ELy — exp(x’B)|x] = 0, 
can be tested by an m-test of 


E[(y — exp(x’))z] = 0, 


where z may be a function of x. For the Poisson and other LEF models, z cannot 
equal x because the first-order conditions for Bun. impose the restriction that $`; (y; — 
exp(x’ 3))x; = 0, leading to M = 0 if z = x. Instead, z could include squares and cross- 
products of the regressors. 

Correct specification of the variance may also be tested, as the Poisson distribution 
implies conditional mean-variance equality. Since V[y|x]—E[y|x] = 0, with E[y|x] = 
exp(x’3), this suggests an m-test of 


ELO — exp(x’B))? — exp(x’B)}x] = 0. 
A variation instead tests 
EHO — exp(x’B))? — y}x] = 0, 
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as E[y|x] = exp(x’3). Then m(8) = {(y — exp(x))* — y}x has the property that 
E[dm/0] = 0, so (8.7) holds and the alternative regression (8.33) yields an asymp- 
totically equivalent test to the regression (8.32). 

A standard specification test for parametric models is the IM test. For the Poisson 
density, D defined in (8.19) becomes D(y, x, 3) = {(y — exp(x’B))? — y}xx’, and we 
test 


ELO — exp(x’B))* — y}Vech[xx’]] = 0. 


Clearly for the Poisson example the IM test is a test of the first and second moment con- 
ditions implied by the Poisson model, a result that holds more generally for LEF mod- 
els. The test statistic M** is asymptotically equivalent to M* since here E[dm/0] = 0. 
The Poisson assumption can also be tested using a chi-square goodness-of-fit test. 
For example, since few counts exceed three in the subsequent simulation example, 
form four cells corresponding to y = 0, 1, 2, and 3 or more, where in implementing 
the test the cell with y = 3 or more are dropped because probabilities sum to one. 
So for j =0,...,2 compute indicator d;; = 1 if y; = j and di; = 0 otherwise and 
compute predicted probability p;; = eG! /j!, where f; = exp(x, 3). Then test 


E[(d — p)] = 0, 


where d; = [djo, di1, di2] and p; = [pio, Pit, Pi2] by the auxiliary regression (8.33) 
where m; = d; — pj. 


Simulation Results 


Data were generated from a Poisson model with mean E[y|x] = exp(6; + 2x2), 
where x2 ~ N’[0, 1] and (£1, 62) = (0, 1). Poisson ML regression of y on x for a sam- 
ple of size 200 yielded 


E = exp(—0.165 + 1.124%), 
Lyla] = expt (0.089) E Lama) 


where associated standard errors are in parentheses. 
The results of the various M-tests are given in Table 8.1. 


Table 8.1. Specification m-Tests for Poisson Regression Example" 


Test Type Ho where ps = exp(x’ 3) M* dof p-value M** 
1. Correct mean El(y — u)x3] =0 3.27 1 0.07 0.44 
2. Variance = mean EHO — u? — u}x] = 0 2.43 2 0.30 1.89 
3. Variance = mean EO — py — y}x] =0 2.43 2 0.30 2.41 
4. Information Matrix E[{(yv — py — y}Vech[xx’]] = 0 2.95 3 0.40 2.73 
5. Chi-square GOF E[d — p] = 2.50 3 0.48 0.75 


^ The dgp for y is the Poisson distribution with mean parameter exp(0 + x2) and sample size N = 200. The 
m-test statistic M* is chi-squared with degrees of freedom given in the dof column and p-value given in the 
p-value column. The alternative test statistic M** is valid for tests 3 and 4 only. 
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As an example of computation of M* using (8.32) consider the IM test. Since x = 
[1, x2] and Vech[xx’] = [1, x2, aol the auxiliary regression is of 1 on {(y — 71)? — y}, 
(0 — RP — yha (0 — R? — y}x3, O — D), and (y — 7x and yields uncentered 
R? = 0.01473 and N = 200, leading to M* = 2.95. The same value of M* is obtained 
directly from the uncentered explained sum of squares of 2.95, and indirectly as N 
minus 197.05, the residual sum of squares from this regression. The test statistic is 
x°(3) distributed with p = 0.40, so the null hypothesis is not rejected at significance 
level 0.05. 

For the chi-square goodness-of-fit test the actual frequencies are, respectively, 
0.435, 0.255, and 0.110; and the corresponding predicted frequencies are 0.429, 0.241, 
and 0.124. This yields PCGF = 0.47 using (8.23), but this statistic is not chi-squared 
as it does not control for error in estimating B. The auxiliary regression for the correct 
statistic CGF in (8.27) leads to M* = 2.50, which is chi-square distributed. 

In this simulation all five moment conditions are not rejected at level 0.05 since 
the p-value for M* exceeds 0.05. This is as expected, as the data in this simulation 
example are generated from the specified density so that tests at level 0.05 should re- 
ject only 5% of the time. The alternative statistic M* is valid only for tests 3 and 
4 since only then does E[ðm/3 6] = 0; otherwise, it only provides a lower bound 
for M. 


8.3. Hausman Test 


Tests based on comparisons between two different estimators are called Hausman tests, 
after Hausman (1978), or Wu—Hausman tests or even Durbin—Wu—Hausman tests after 
Wu (1973) and Durbin (1954) who proposed similar tests. 


8.3.1. Hausman Test 


Consider a test for endogeneity of a regressor in a single equation. Two alternative es- 
timators are the OLS and 2SLS estimators, where the 2SLS estimator uses instruments 
to control for possible endogeneity of the regressor. If there is endogeneity then OLS 
is inconsistent, so the two estimators will have different probability limit. If there is no 
endogeneity both estimators are consistent, so the two estimators have the same prob- 
ability limit. This suggests testing for endogeneity by testing for difference between 
the OLS and 2SLS estimators, see Section 8.4.3 for further discussion. 
More generally, consider two estimators @ and Q. We consider the testing situation 
where 
Ho : plim@ — 0) = 0, 


H, : plim(@ — 0) 4 0. (8.34) 


Assume the difference between the two root-N consistent estimators is also root-N 
consistent under Hp with mean 0 and a limit normal distribution, so that 


JN@ — 6) N I0, Vul, 
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where Vy denotes the variance matrix in the limiting distribution. Then the Hausman 
test statistic 


H= @ -DNY @ — 8) (8.35) 


is asymptotically x?(q) distributed under Hy. We reject Ho at level œ if H > x2(q). 

In some applications, such as tests of endogeneity, a) — 0] is of less than full rank. 
Then the generalized inverse is used in (8.35) and the chi-square test has degrees of 
freedom equal to the rank of iC) — 0]. 

The Hausman test can be applied to just a subset of the parameters. For example, 
interest may lie solely in the coefficient of the possibly endogenous regressor and 
whether it changes in moving from OLS to 2SLS. Then just one component of 0 is 
used and the test statistic is y7(1) distributed. As in other settings, this test on a subset 
of parameters can lead to a conclusion different from that of a test on all parameters. 


8.3.2. Computation of the Hausman Test 


Computing the Hausman test is easy in principle but difficult in practice owing to the 
need to obtain a consistent estimate of Vy, the limit variance matrix of v N(@ — 0). In 
general 


N~'Vy = VIO — 6] = VIO] + VIÐ] — 2Cov[ð, 6]. (8.36) 


The first two quantities are readily computed from the usual output, but the third is 
not. 


Computation for Fully Efficient Estimator under the Null Hypothesis 


Although the essential null and alternative hypotheses of the Hausman test are as in 
(8.34), in applications there is usually a specific null hypothesis model and alternative 
hypothesis in mind. For example, in comparing OLS and 2SLS estimators the null hy- 
pothesis model has all regressors exogenous whereas the alternative hypothesis model 
permits some regressors to be endogenous. hs 7 

If 8 is the efficient estimator in the null hypothesis model, then Cov[6, 0] = V[@]. 
For proof see Exercise 8.3. This implies iC) — 0] = VI@l-VI6@l, so 


H = @ — 6y (V[6] — VAN @ — 6). (8.37) 


This statistic has the considerable advantage of requiring only the estimated asymptotic 
variance matrices of the parameter estimates 6 and @. It is helpful to use a program 
that permits saving parameter and variance matrix estimates and computation using 
matrix commands. 

For example, this simplification can be applied to endogeneity tests in a linear re- 
gression model if the errors are assumed to be homoskedastic. Then @ is the OLS 
estimator that is fully efficient under the null hypothesis of no endogeneity, and @ is 
the 2SLS estimator. Care is needed, however, to ensure the consistent estimates of the 
variance matrices are such that V0] — V8] is positive definite (see Ruud, 1984). In 
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the OLS—2SLS comparison the variance matrix estimators Vo] and VA should use 
the same estimate of the error variance o°. 
Version (8.37) of the Hausman test is especially easy to calculate by hand if 0 is a 


scalar, or if only one component of the parameter vector is tested. Then 
= 6-07 /@ -S) 


is x7(1) distributed, where 5 and 5 are the reported standard errors of @ and 6. 


Auxiliary Regressions 


In some leading cases the Hausman test can be more simply computed as a standard 
test for the significance of a subset of regressors in an augmented OLS regression, 
derived under the assumption that 0 is fully efficient. Examples are given in Section 
8.4.3 and in Section 21.4.3. 


Robust Hausman Tests 


The simpler version (8.37) of the Hausman test, and standard auxiliary regressions, 
requires the strong distributional assumption that 6 is fully efficient. This is counter 
to the approach of performing robust inference under relatively weak distributional 
assumptions. 

Direct estimation of Cov[ð, 6] and hence Vy is in principle possible. Suppose 0 6 and 
@ are m- -estimators that solve Dae h; @ = 0 and `; hx(0) = 0. Define 6 = (0, 6]. 
Then V[6] = Go 'So(Gp! Y, where Go and So are ¢ defined 1 in, Section 6.6, with the sim- 
plification that here G12 = 0. The desired V[é — ð] = RV[6|R, where R = [I,, —Iy]. 
Implementation can require additional coding that may be application specific. 

A simpler approach is to bootstrap (see Section 11.6.3), though care is needed in 
some applications to ensure use of the correct degrees of freedom in the chi-square 
test. 

Another possible approach for less than fully efficient @ is to use an auxiliary re- 
gression that is appropriate in the efficient case but to perform the subsets of regres- 
sors test using robust standard errors. This robust test is simple to implement and will 
have power in testing the misspecification of interest, though it may not necessarily be 
equivalent to the Hausman test that uses the more general form of H given in (8.35). 
An example is given in Section 21.4.3. N 

Finally, bounds can be calculated that do not require computation of Cov [0, 0]. For 
scalar random variables, Cov[x, y] < s,sy. For the sede case this suggests an upper 
bound for H of N@- 8) 212 + 3? — 255), where 32 = V(O] and 3? = = - V[6]. A lower 
bound for H is N @ -— 9} /(s? +3°), under the assumption that @ and ĝ are positively 
correlated. In practice, however, these bounds are quite wide. 


8.3.3. Power of the Hausman Test 


The Hausman test is a quite general procedure that does not explicitly state an alterna- 
tive hypothesis and therefore need not have high power against particular alternatives. 
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For example, consider tests of exclusion restrictions in fully parametric models. De- 
note the null hypothesis Ho : 02 = 0, where @ is partitioned as (0/1, 05)’. An obvious 
specification test is a Hausman test of the difference 6 — 0), where (61, 0>) is the un- 
restricted MLE and (0, 0) is the restricted MLE of 0. Holly (1982) showed that this 
Hausman test coincides with a classical test (Wald, LR, or LM) of Ho : Z ie 1202 = 0, 
where 7;; =E[a7L(6), 6>)/00;90 ; |, rather than of Ho : 02 = 0. The two tests co- 
incide if Zı2 is of full column rank and dim(@,) >dim(@2), as then T,'T1202 =0 
iff 02 = 0. Otherwise, they can differ. Clearly, the Hausman test will have no power 
against Ho if the information matrix is block diagonal as then Zı2 = 0. Holly (1987) 
extended analysis to nonlinear hypotheses. 


8.4. Tests for Some Common Misspecifications 


In this section we present tests for some common model misspecifications. Attention 
is focused on test statistics that can be computed using auxiliary regressions, using 
minimal assumptions to permit inference robust to heteroskedastic errors. 


8.4.1. Tests for Omitted Variables 


Omitted variables usually lead to inconsistent parameter estimates, except for special 
cases such as an omitted regressor in the linear model that is uncorrelated with the 
other regressors. It is therefore important to test for potential omitted variables. 

The Wald test is most often used as it is usually no more difficult to estimate the 
model with omitted variables included than to estimate the restricted model with omit- 
ted variables excluded. Furthermore, this test can use robust sandwich standard errors, 
though this really only makes sense if the estimator retains consistency in situations 
where robust sandwich errors are necessary. 

If attention is restricted to ML estimation an alternative is to estimate models with 
and without the potentially irrelevant regressors and perform an LR test. 

Robust forms of the LM test can be easily computed in some settings. For example, 
consider a test of Ho : 6, = 0 in the Poisson model with mean exp(x| 3, + x382). The 
LM test statistic is based on the score statistic `, x;#;, where u; = y; — exp (X1; 61) 
(see Section 7.3.2). Now a heteroskedastic robust estimate for the variance of 
N~'/? >, x;u;, where u; = y; — ELy;|x;], is N7! $C, u?x;x;, and it can be shown that 


n l n = n 
LMt = J l i z si 
i=1 i=1 i=1 


is a robust LM test statistic that does not require the Poisson restriction that V[u;|x;] = 
exp (x;,3,) under Ho. This can be computed as N times the uncentered R? from re- 
gression of 1 on x1;#; and x;u“;. Such robust LM tests are possible more generally for 
assumed models in the linear exponential family, as the score statistic in such models is 
again a weighted average of a residual 1; (see Wooldridge, 1991). This class includes 
OLS, and adaptations are also possible when estimation is by 2SLS or by NLS; see 
Wooldridge (2002). 
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8.4.2. Tests for Heteroskedasticity 


Parameter estimates in linear or nonlinear regression models of the conditional mean 
estimated by LS or IV methods retain their consistency in the presence of het- 
eroskedasticity. The only correction needed is to the standard errors of these estimates. 
This does not require modeling heteroskedasticity, as heteroskedastic-robust standard 
errors can be computed under minimal distributional assumptions using the result of 
White (1980). So there is little need to test for heteroskedasticity, unless estimator 
efficiency is of great concern. Nonetheless, we summarize some results on tests for 
heteroskedasticity. 

We begin with LS estimation of the linear regression model y = x’ + u. Suppose 
heteroskedasticity is modeled by V[u|x] = g(a; + z’a@2), where z is usually a sub- 
set of x and g(-) is often the exponential function. The literature focuses on tests of 
Ho : œz = 0 using the LM approach because, unlike Wald and LR tests, these require 
only OLS estimation of 3. The standard LM test of Breusch and Pagan (1979) depends 
heavily on the assumption of normally distributed errors, as it uses the restriction that 
E[u+|x*] = 304 under Ho. Koenker (1981) proposed a more robust version of the LM 
test, N R? from regression of n? on 1 and z;, where T; is the OLS residual. This test re- 
quires the weaker assumption that E[w*|x] is constant. Like the Breusch-Pagan test it 
is invariant to choice of the function g(-). The White (1980a) test for heteroskedasticity 
is equivalent to this LM test, with z = Vech[xx’]. The test can be further generalized 
to let E[u*|x] vary with x, though constancy may be a reasonable assumption for the 
test since Ho already specifies that E[u?|x] is constant. 

Qualitatively similar results carry over to nonlinear models of the conditional mean 
that assume a particular form of heteroskedasticity that may be tested for misspec- 
ification. For example, the Poisson regression model sets V[y|x] = exp (x’3). More 
generally, for models in the linear exponential family, the quasi-MLE is consistent 
despite misspecified heteroskedasticity and qualitatively similar results to those here 
apply. Then valid inference is possible even if the model for heteroskedasticity is mis- 
specified, provided the robust standard errors presented in Section 5.7.4 are used. If 
one still wishes to test for correct specification of heteroskedasticity then robust LM 
tests are possible (see Wooldridge, 1991). 

Heteroskedasticity can lead to the more serious consequence of inconsistency of pa- 
rameter estimates in some nonlinear models. A leading example is the Tobit model (see 
Chapter 16), a linear regression model with normal homoskedastic errors that becomes 
nonlinear as the result of censoring or truncation. Then testing for heteroskedasticity 
becomes more important. A model for V[u|x] can be specified and Wald, LR, or LM 
tests can be performed or m-tests for heteroskedasticity can be used (see Pagan and 
Vella, 1989). 


8.4.3. Hausman Tests for Endogeneity 


Instrumental variables estimators should only be used where there is a need for them, 
since LS estimators are more efficient if all regressors are exogenous and from Sec- 
tion 4.9 this loss of efficiency can be substantial. It can therefore be useful to test 
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whether IV methods are needed. A test for endogeneity of regressors compares IV 
estimates with LS estimates. If regressors are endogenous then in the limit these esti- 
mates will differ, whereas if regressors are exogenous the two estimators will not differ. 
Thus large differences between LS and IV estimates can be interpreted as evidence of 
endogeneity. 

This example provides the original motivation for the Hausman test. Consider the 
linear regression model 


y =X, bı + x,B) +u, (8.38) 


where x; is potentially endogenous and x3 is exogenous. Let B be the OLS estimator 
and 8 be the 2SLS estimator in (8.38). Assuming homoskedastic errors so that OLS is 
efficient under the null hypothesis of no endogeneity, a Hausman test of endogeneity of 
x; can be calculated using the test statistic H defined in (8.37). Because VIB) — VIB] 
can be shown to be not of full rank, however, a generalized inverse is needed and the 
degrees of freedom are dim(G,) rather than dim(@). 

Hausman (1978) showed that the test can more simply be implemented by test of 
y = 0 in the augmented OLS regression 


y =X b +x% +8 y +u, 


where X, is the predicted value of the endogenous regressors x; from reduced form 
multivariate regression of x; on the instruments z. Equivalently, we can test y = 0 in 
the augmented OLS regression 


y = xB, +x% b +V y+, 


where V, is the residual from the reduced form multivariate regression of xı on the 
instruments z. Intuition for these tests is that if u in (8.38) is uncorrelated with xı 
and Xx», then y = 0. If instead u is correlated with x,, then this will be picked up by 
significance of additional transformations of x; such as X; and V}. 

For cross-section data it is customary to presume heteroskedastic errors. Then the 
OLS estimator B is inefficient in (8.38) and the simpler version (8.37) of the Haus- 
man test cannot be used. However, the preceding augmented OLS regressions can 
still be used, provided y = 0 is tested using the heteroskedastic-consistent estimate of 
the variance matrix. This should actually be equivalent to the Hausman test, as from 
Davidson and MacKinnon (1993, p. 239) Fors in these augmented regressions equals 
A w(B — B), where Ay is a full-rank matrix with finite probability limit. 

Additional Hausman tests for endogeneity are possible. Suppose y = x}, + 
x, 3, + x43 +u, where x; is potentially endogenous x, is assumed to be endoge- 
nous, and x3 is assumed to be exogenous. Then endogeneity of x, can be tested 
by comparing the 2SLS estimator with just x2 instrumented to the 2SLS estima- 
tor with both x; and x, instrumented. The Hausman test can also be generalized 
to nonlinear regression models, with OLS replaced by NLS and 2SLS replaced 
by NL2SLS. Davidson and MacKinnon (1993) present augmented regressions that 
can be used to compute the relevant Hausman test, assuming homoskedastic errors. 
Mroz (1987) provides a good application of endogeneity tests including examples of 
computation of vio — ð] when @ is not efficient. 
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8.4.4. OIR Tests for Exogeneity 


If an IV estimator is used then the instruments must be exogenous for the IV estimator 
to be consistent. For just-identified models it is not possible to test for instrument 
exogeneity. Instead, a priori arguments need to be used to justify instrument validity. 
Some examples are given in Section 4.8.2. For overidentified models, however, a test 
for exogeneity of instruments is possible. 

We begin with linear regression. Then y = x‘G + u and instruments z are valid 
if E[u|z] = 0 or if E[zu] = 0. An obvious test of Ho : E[zu] = 0 is based on depar- 
tures of N7! oF Zil; from zero. In the just-identified case the IV estimator solves 
N7! La Zil; = 0 so this test is not useful. In the overidentified case the overidentify- 
ing restrictions test presented in Section 6.3.8 is 


OIR = TZS! Z, (8.39) 


where u = y — XB, B is the optimal GMM estimator that minimizes u ZSZ u, and 
S is consistent for plim N`! X`; u?z;z;. The OIR test of Hansen (1982) is an extension 
of a test proposed by Sargan (1958) for linear IV, and the test statistic (8.39) is often 
called a Sargan test. If OIR is large then the moment conditions are rejected and the 
IV estimator is inconsistent. Rejection of Ho is usually interpreted as evidence that the 
instruments z are endogenous, but it could also be evidence of model misspecifica- 
tion so that in fact y Æ x’G + u. In either case rejection indicates problems for the IV 
estimator. 

As formally derived in Section 6.3.9, OIR is distributed as x?(r — K) under Ho, 
where (r — K) is the number of overidentifying restrictions. To gain some intuition for 
this result it is useful to specialize to homoskedastic errors. Then S = ZZ, where 
€? = TT/(N — K), so 

wPzu 


OIR = =. 
wu/(N — K) 


where Pz = Z(Z'Z)~'Z’. Thus OIR is a ratio of quadratic forms in 0. Under Ho the 
numerator has probability limit ?(r — K) and the denominator has plim £? = 07, so 
the ratio is centered on r — K, but this is the mean of a x?(r — K) random variable. 

The test statistic in (8.39) extends immediately to nonlinear regression, by simply 
defining u; = y — g(x, B) or u = r(y,x, B) as in Section 6.5, and to linear systems 
and panel estimators by appropriate definition of u (see Sections 6.9 and 6.10). 

For linear IV with homoskedastic errors alternative OIR tests to (8.39) have been 
proposed. Magdalinos (1988) contrasts a number of these tests. One can also use in- 


cremental OIR tests of a subset of overidentifying restrictions. 


8.4.5. RESET Test 


A common functional form misspecification may involve neglected nonlinearity in 
some of the regressors. Consider the regression y = x’3 + u, where we assume that the 
regressors enter linearly and are asymptotically uncorrelated with the error u. To test 
for nonlinearity one straightforward approach is to enter power functions of exogenous 
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variables, most commonly squares, as additional independent regressors and test the 
statistical significance of these additional variables using a Wald test or an F-test. 
This requires the investigator to have specific reasons for considering nonlinearity, and 
clearly the technique will not work for categorical x variables. 

Ramsey (1969) suggested a test of omitted variables from the regression that can 
be formulated as a test of functional form. The proposal is to fit the initial regres- 
sion and generate new regressors that are functions of fitted values y= xÂ, such 
as W= (x8, KBP, Pe (xB)? ]. Then estimate the model y = x' 8B + w'y + u, 
and the test of nonlinearity is the Wald test of p restrictions, Ho : y = 0 against 
Ha : y Æ 0. Typically a low value of p such as 2 or 3 is used. This test can be made 
robust to heteroskedasticity. 


8.5. Discriminating between Nonnested Models 


Two models are nested if one is a special case of the other; they are nonnested if 
neither can be represented as a special case of the other. Discriminating between nested 
models is possible using a standard hypothesis test of the parametric restrictions that 
reduce one model to the other. In the nonnested case, however, alternative methods 
need to be developed. 

The presentation focuses on nonnested model discrimination within the likelihood 
framework, where results are well developed. A brief discussion of the nonlikelihood 
case is given in Section 8.5.4. Bayesian methods for model discrimination are pre- 
sented in Section 13.8. 


8.5.1. Information Criteria 


Information criteria are log-likelihood criteria with degrees of freedom adjustment. 
The model with the smallest information criterion is preferred. 

The essential intuition is that there exists a tension between model fit, as measured 
by the maximized log-likelihood value, and the principle of parsimony that favors a 
simple model. The fit of the model can be improved by increasing model complexity. 
However, parameters are only added if the resulting improvement in fit sufficiently 
compensates for loss of parsimony. Note that in this viewpoint it is not necessary 
that the set of models under consideration should include the “true dgp.” Different 
information criteria vary in how steeply they penalize model complexity. 

Akaike (1973) originally proposed the Akaike information criterion 


AIC = —2InL +24, (8.40) 


where q is the number of parameters, with the model with lowest AIC preferred. The 
term information criterion is used because the underlying theory, presented more sim- 
ply in Amemiya (1980), discriminates among models using the Kullback—Liebler in- 
formation criterion (KLIC). 

A considerable number of modifications to AIC have been proposed, all of the form 
—21nL+g(q, N) for specified penalty function g(-) that exceeds 2g. The most popular 
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variation is the Bayesian information criterion 
BIC = —2InL+ (InN)q, (8.41) 


proposed by Schwarz (1978). Schwarz assumed y has density in the exponential family 
with parameter 0, the jth model has parameter 0; with dim[@;] = q; < dim[@], and 
the prior across models is a weighted sum of the prior for each 8 ;. He showed that un- 
der these assumptions maximizing the posterior probability (see Chapter 13) is asymp- 
totically equivalent to choosing the model for which In L — (In N)q;/2 is largest. Since 
this is equivalent to minimizing (8.41), the procedure of Schwarz has been labeled the 
Bayesian information criterion. A refinement of AIC based on minimization of KLIC 
that is similar to BIC is the consistent AIC, CAIC= —2InL+ (1 +1n N)q. Some 
authors define criteria such as AIC and BIC by additionally dividing by N in the right- 
hand sides of (8.40) and (8.41). 

If model parsimony is important, then BIC is more widely used as the model-size 
penalty for AIC is relatively low. Consider two nested models with qı and q2 parame- 
ters, respectively, where q2 = qi + h. An LR test is then possible and favors the larger 
model at significance level 5% if 2InL increases by x7;(h). AIC favors the larger 
model if 2 In L increases by more than 2h, a lesser penalty for model size than the LR 
test if h < 7. In particular for h = 1, that is, one restriction, the LR test uses a 5% 
critical value of 3.84 whereas AIC uses a much lower value of 2. The BIC favors the 
larger model if 2 In L increases by h In N, a much larger penalty than either AIC or an 
LR test of size 0.05 (unless N is exceptionally small). 

The Bayesian information criterion increases the penalty as sample size increases, 
whereas traditional hypothesis tests at a significance level such as 5% do not. For 
nested models with q2 = qı + 1 choosing the larger model on the basis of lower BIC 
is equivalent to using a two-sided t-test critical value of vln N, which equals 2.15, 
3.03, and 3.72, respectively, for N = 107, 10*, and 10°. By comparison traditional hy- 
pothesis tests with size 0.05 use an unchanging critical value of 1.96. More generally, 
for a x7(h) distributed test statistic the BIC suggests using a critical value of h In N 
rather than the customary X55(A). 

Given their simplicity, penalized likelihood criteria are often used for selecting “the 
best model.” However, there is no clear answer as to which criterion, if any, should 
be preferred. Considerable approximation is involved in deriving the formulas for AIC 
and related measures, and loss functions other than minimization of KLIC, or max- 
imization of the posterior probability in the case of BIC, might be much more ap- 
propriate. From a decision-theoretic viewpoint, the choice of the model from a set of 
models should depend on the intended use of that model. For example, the purpose of 
the model may be to summarize the main features of a complex reality, or to predict 
some outcome, or to test some important hypothesis. In applied work it is quite rare to 
see an explicit statement of the intended use of an econometric model. 


8.5.2. Cox Likelihood Ratio Test of Nonnested Models 


Consider choosing between two parametric models. Let model Fg have density 
flx, 8) and model Gy have density g(y|x, y). 
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A likelihood ratio test of the model Fg against G, is based on 


£0 wv tite hoa fOilxi 8) 
LRO, ¥) = L;@) -L In 8.42 
aa aa a 8 (vii) a 


If G, is nested in Fg then, from Section 7.3.1, 2LRO@, ¥) is chi-square distributed 
under the null hypothesis that Fy = G,. However, this result no longer holds if the 
models are nonnested. 

Cox (1961, 1962b) proposed solving this problem in the special case that Fg is the 
true model but the models are not nested, by applying a central limit theorem under 
the assumption that F¢ is the true model. 

This approach is computationally awkward to implement if one cannot analytically 
obtain E¢[In( f (|x, 8)/g(y|x, y))], where Ey denotes expectation with respect to the 
density f(y|x, 0). Furthermore, if a similar test statistic is obtained with the roles of 
Fg and G, reversed it is possible to find both that model Fọ is rejected in favor of 
G., and that model G, is rejected in favor of Fg. The test is therefore not necessarily 
one of model selection as it does not necessarily select one or the other; instead it is a 
model specification test that zero, one, or two of the models can pass. 

The Cox statistic has been obtained analytically in some cases. For nonnested 
linear regression models y = x’G+u and y=z'y+v with homoskedastic nor- 
mally distributed errors (see Pesaran, 1974). For nonnested transformation models 
h(y) =x’ B +u and g (y) = zy + v, where h(y) and g(y) are known transforma- 
tions; see Pesaran and Pesaran (1995), who use a simulation-based approach. This 
permits, for example, discrimination between linear and log-linear parametric mod- 
els, with h(-) the identity transformation and g(-) the log transformation. Pesaran and 
Pesaran (1995) apply the idea to choosing between logit and probit models presented in 
Chapter 14. 


8.5.3. Vuong Likelihood Ratio Test of Nonnested Models 


Vuong (1989) provided a very general distribution theory for the LR test statistic that 
covers both nested and nonnested models and more remarkably permits the dgp to be 
an unknown density that differs from both f(-) and g(-). 

The asymptotic results of Vuong, presented here to aid understanding of the variety 
of tests presented in Vuong’s paper, are relatively complex as in some cases the test 
statistic is a weighted sum of chi-squares with weights that can be difficult to compute. 

Vuong proposed a test of 


Ho ` Eo [r ee] = 0, 


8.43 
gy Ix, y) oa”) 


where Ep denotes expectation with respect to the true dgp h(y|x), which may be un- 
known. This is equivalent to testing E; [In(i/g)]—E,[In(h/f)] = 0, or testing whether 
the two densities f and g have the same Kullback—Liebler information criterion 
(see Section 5.7.2). One-sided alternatives are possible with H+ : Eo[In( f/g)] > 0 and 
H; : Eolln(f/g)] < 0. 
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An obvious test of Ho is an m-test of whether the sample analogue LR, 7) defined 
in (8.42) differs from zero. Here the distribution of the test statistic is to be obtained 
with possibly unknown dgp. This is possible because from Section 5.7.1 the quasi- 
MLE ð converges to the pseudo-true value 6* and VN NO — 6*) has a limit normal 
distribution, with a similar result for the quasi-MLE ¥. 


General Result 


The resulting distribution of LR, 7) varies according to whether or not the two mod- 
els, both possibly incorrect, are equivalent in the sense that f(y|x, 04) = g(y|x, Y4) 
where @,, and y, are the pseudo-true values of 0 and +. 


If f(x, 04) = g(x, Y,) then 
LRO, P) S Mp4 (Àn), (8.44) 


where p and q are the dimensions of 0 and y and M,,,(A,.) denotes the cdf of the 
weighted sum of chi-squared variables y yj Z;. The Z? are iid x?(1) and A,; are 
the eigenvalues of the (p +q) x (p +q) matx 


W= | —B,;(0.)A/(Ox) | —B fg (0x, A 
=B; (Ya OA (Ox)! -B(A |’ 


where A (04) = Eo[3? In f/3006'], B (04) =Eo[(0 In f/30)(ə In f/30")], the matri- 
ces A,(7y,) and B,(7,,) are similarly defined for the density g(-), the cross-matrix 
B e(O., Y4) =Eo[(3 In f/08)(0 In g/d-y’)], and expectations are with respect to the 
true dgp. For explanation and derivation of these results see Vuong (1989). 

If instead f(y|x, 04) Æ g(y|x, Yx), then under Ho 


(8.45) 


N-'?LRO, F) S NTO, «2 ], (8.46) 
where 
0, 
o2 = Vo E Loxe] : (8.47) 
OIX, Ya) 


and the variance is with respect to the true dgp. For derivation again see Vuong (1989). 

Use of these results varies with whether or not one model is assumed to be correctly 
specified and with the nesting relationship between the two models. 

Vuong differentiated among three types of model comparisons. The models Fg and 
G, are (1) nested with G} nested in Fg if Gy C Fo; (2) strictly nonnested models 
if and only if Fg O G, = ¢ so that neither model can specialize to the other; and 
(3) overlapping if Fg N Gy 4 ¢ and Fo Ç G, and Gy & Fo. Similar distinctions are 
made by Pesaran and Pesaran (1995). 

Both (2) and (3) are nonnested models, but they require different testing procedures. 
Examples of strictly nonnested models are linear models with different error distribu- 
tions and nonlinear regression models with the same error distributions but different 
functional forms for the conditional mean. For overlapping models some specializa- 
tions of the two models are equal. An example is linear models with some regressors 
in common and some regressors not in common. 
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Nested Models 


For nested models it is necessarily the case that f(y|x, 0.) = g(y|x, ¥,). For Gy 
nested in Fg, Ho is tested against H: Eo[In( f/g)] > 0. 

For density possibly misspecified the weighted chi-square result (8.44) is appropri- 
ate, using the eigenvalues pi j of the sample analogue of W in (8.45). Alternatively, one 
can use eigenvalues x j of the sample analogue of the smaller matrix 


W = B,;(6.)[D(y7,)Ag(Y,) DA, — Ap (2) |], 


where D(7,,) = 0@(7,,)/077 and the constrained quasi-MLE ð= (7), see Vuong 
(1989). This result provides a robustified version of the standard LR test for nested 
models. 

If the density f(-) is actually correctly specified, or more generally satisfies the IM 
equality, we get the expected result that 2LRO, 7) Z x?(p — q) as then (p — q) of 
the eigenvalues of W or W equal one whereas the others equal zero. 


Strictly Nonnested Models 


For strictly nonnested models it is necessarily the case that f(y|x, 04) Æ 8(y|X, Y4). 


The normal distribution result (8.46) is applicable, and a consistent estimate of w2 is 


N ~\ 2 N ~\ 2 
ees) (i tard) ( Ba farh) a 


=j gyi lX, 7) ar g(yilXi, 7) 


Thus form 
Tir = N7!?LR@,4)/OS NTO, 1). (8.49) 


For tests with critical value c, Ho is rejected in favor of Hp :EoUn(f/g8)] > 0 if 
Tir > c, Ho is rejected in favor of H, :Eo[ln(f/g)] < 0 if Tir < —c, and discrimi- 
nation between the two models is not possible if |Tir| < c. The test can be modified 
to permit log-likelihood penalties similar to AIC and BIC; see Vuong (1989, p. 316). 
An asymptotically equivalent statistic to (8.49) replaces © by © equal to just the first 
term in the right-hand side of (8.48). 

This test assumes that both models are misspecified. If instead one of the models is 
assumed to be correctly specified, the Cox test approach of Section 8.5.2 needs to be 
used. 


Overlapping Models 


For overlapping models it is not clear a priori as to whether or not f(y|x, 0,) = 
g(y|Xx, Y), and one needs to first test this condition. 

Vuong (1989) proposes testing whether or not the variance œ? defined in (8.47) 
equals zero, since w? = 0 if and only if f(-) = g(-). Thus compute ©- in (8.48). Under 
HE : a2 =0 


NOS Mytq(As)s (8.50) 
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where the M pta (An) distribution is defined after (8.44). Hypothesis Ho” is rejected at 
level œ if NO 2 exceeds the upper a percentile of the M pea) distribution, using the 
eigenvalues hy of the sample analogue of W in (8.45). Alternatively, and more simply, 
one can test the conditions that 0, and y„ must satisfy for f(-) = g(-). Examples are 
given in Lien and Vuong (1987). 

If Hy is not rejected, or the conditions for f(-) = g(-) are not rejected, conclude 
that it is not possible to discriminate between the two models given the data. If HẸ is 
rejected, or the conditions for f(-) = g(-) are rejected, then test Ho against Hy or H; 
using Trp as detailed in the strictly nonnested case. In this latter case the significance 
level is at most the maximum of the significance levels for each of the two tests. 

This test assumes that both models are misspecified. If instead one of the models is 
assumed to be correctly specified, then the other model must also be correctly specified 
for the two models to be equivalent. Thus f(y|x, 0.) = g(yI|x, Y,) under Ho, and one 
can directly move to the LR test using the weighted chi-square result (8.44). Let cı and 
c2 be upper tail and lower tail critical values, respectively. If 2LRO, 7) > cı then Ho 
is rejected in favor of Hp; if 2LRO, 7) < c2 then Ho is rejected in favor of H,; and 
the test is otherwise inconclusive. 


8.5.4. Other Nonnested Model Comparisons 


The preceding methods are restricted to fully parametric models. Methods for discrim- 
inating between models that are only partially parameterized, such as linear regression 
without the assumption of normality, are less clear-cut. 

The information criteria of Section 8.5.1 can be replaced by criteria developed using 
loss functions other than KLIC. A variety of measures corresponding to different loss 
functions are presented in Amemiya (1980). These measures are often motivated for 
nested models but may also be applicable to nonnested models. 

A simple approach is to compare predictive ability, selecting the model with low- 
est value of mean-squared error (N — q)! X; (Yi — Jı). For linear regression this is 
equivalent to choosing the model with highest adjusted R?, which is generally viewed 
as providing too small a penalty for model complexity. An adaptation for nonparamet- 
ric regression is leave-one-out cross-validation (see Section 9.5.3). 

Formal tests to discriminate between nonnested models in the nonlikelihood case 
often take one of two approaches. Artificial nesting, proposed by Davidson and 
MacKinnon (1984), embeds the two nonnested models into a more general artificial 
model and leads to so-called J tests and P tests and related tests. The encompassing 
principle, proposed by Mizon and Richard (1986), leads to a quite general framework 
for testing one model against a competing nonnested model. White (1994) links this 
approach with CM tests. For a summary of this literature see Davidson and MacKinnon 
(1993, chapter 11). 


8.5.5. Nonnested Models Example 


A sample of 100 observations is generated from a Poisson model with mean E[y|x] = 
exp(B1 + 2x2 + B3x3), where x2, x3 ~ NTO, 1], and (61, Bo, 63) = (0.5, 0.5, 0.5). 
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Table 8.2. Nonnested Model Comparisons for Poisson Regression Example‘ 


Test Type Model 1 Model 2 Conclusion 
—2ln L 366.86 352.18 Model 2 preferred 
AIC 370.86 358.18 Model 2 preferred 
BIC 376.07 366.00 Model 2 preferred 
NO? 7.84 with p = 0.000 Can discriminate 
TLR = N7'?PLR/O —0.883 with p = 0.377 No model favored 


a N = 100. Model 1 is Poisson regression of y on intercept and x2. Model 2 is Poisson regression 
of y on intercept, x3, and x3, The final two rows are for the Vuong test for nonoverlapping models 
(see the text). 


The dependent variable y has sample mean 1.92 and standard deviation 1.84. Two 
incorrect nonnested models were estimated by Poisson regression: 


Model 1: Elylx] = exp(0,608 + 0.2912), 


Model 2: ELy|x] = exp(0.493 + 0.359x3 + 0.0913), 
(5.14) (5.10) (1.78) 


where t—statistics are given in parentheses. 

The first three rows of Table 8.2 give various information criteria, with the model 
with smallest value preferred. The first does not penalize number of parameters and 
favors model 2. The second and third measures defined in (8.40) and (8.41) give larger 
penalty to model 2, which has an additional parameter, but still lead to the larger model 
2 being favored. 

The final two rows of the Table 8.2 summarize Vuong’s test, here a test of overlap- 
ping models. 

First, test the condition of equality of the densities when evaluated at the pseudo- 
true values. The statistic ©” in (8.48) is easily computed given expressions for the 
densities. The difficult part is computing an estimate of the matrix W in (8.45). For 
the Poisson density we can use A and B defined at the end of Section 5.2.3 and 
Bre = N11 — DeX X OF — Hgi)X,;- The eigenvalues of W are ài = 0.29, 
Az = 1.00, 43 = 1.06, 44 = 1.48, and A5 = 2.75. The p-value for the test statis- 
tic NO? with distribution given in (8.44) is obtained as the proportion of draws of 
E jz, say 10,000 draws, which exceed NO = 69.14. Here p = 0.000 < 0.05 
and we conclude that it is possible to discriminate between the models. The critical 
value at level 0.05 in this example equals 16.10, quite a bit higher than Hae) = 
11.07. 

Given discrimination is possible, then the second test can be applied. Here Tir = 
—0.883 favors the second model, since it is negative. However, using a standard normal 
two-tail test at 5% the difference is not statistically significant. In this example ©? is 
quite large, which means the first test statistic NO? is large but the second test statistic 
N~'?LR@, ¥)/@ is small. 
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8.6. Consequences of Testing 


In practice more than one test is performed before one reaches a preferred model. This 
leads to several complications that practitioners usually ignore. 


8.6.1. Pretest Estimation 


The use of specification tests to choose a model complicates the distribution of an 
estimator. For example, suppose we choose between two estimators @ and @ on the 
basis of a statistical test at 5%. For instance, @ and 0 may be estimators in unrestricted 
and restricted models. Then the actual estimator is 6+ = wð + (l — wd, , where the 
random variable w takes value 1 if the test favors @ and 0 if the test favors @. In short, 
the estimator depends on the restricted and unrestricted estimators and on a random 
variable w, which in turn depends on the significance level of the test. Hence O is an 
estimator with complex properties. This is called a pretest estimator, as the estimator 
is based on an initial test. The distribution of O" has been obtained for the linear 
regression model under normality and is nonstandard. 

In theory statistical inference should be based on the distribution of 0”. In practice 
inference is based on the distribution of @ if w = 1 or of 6 if w = 0, ignoring the 
randomness in w. This is done for simplicity, as even in the simplest models the dis- 
tribution of the estimator becomes intractable when several such tests are performed. 


8.6.2. Order of Testing 


Different conclusions can be drawn according to the order in which tests are con- 
ducted. 

One possible ordering is from general to specific model. For example, one may 
estimate a general model for demand before testing restrictions from consumer de- 
mand theory such as homogeneity and symmetry. Or the cycle may go from specific 
to general model, with regressors added as needed and additional complications such 
as endogeneity controlled for if present. Such orderings are natural when choosing 
which regressors to include in a model, but when specification tests are also being 
performed it is not uncommon to use both general to specific and specific to general 
orderings in the same study. 

A related issue is that of joint versus separate tests. For example, the significance 
of two regressors can be tested by either two individual t—tests of significance or a 
joint F—test or x7(2) test of significance. A general discussion was given in Sec- 
tion 7.2.7 and an example is given later in Section 18.7. 


8.6.3. Data Mining 


Taken to its extreme, the extensive use of tests to select a model has been called data 
mining (Lovell, 1983). For example, one may search among several hundred possible 
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predictors of y and choose just those predictors that are significant at 5% on a two- 
sided test. Computer programs exist that automate such searches and are commonly 
used in some branches of applied statistics. Unfortunately, such broad searches will 
lead to discovery of spurious relationships, since a test with size 0.05 leads to er- 
roneous findings of statistical significance 5% of the time. Lovell pointed out that 
the application of such a methodology tends to overestimate the goodness-of-fit mea- 
sures (e.g., R?) and underestimate the sampling variances of regression coefficients, 
even when it succeeds in uncovering the variables that feature in the data-generating 
process. Using standard tests and reporting p-values without taking account of the 
model-search procedure is misleading because nominal and actual p-values are not 
the same. White (2001b) and Sullivan, Timmermann, and White (2001) show how to 
use bootstrap methods to calculate the true statistical significance of regressors. See 
also P. Hansen (2003). 

The motivation for data mining is sometimes to conserve degrees of freedom or 
to avoid overparameterization (“clutter”). More importantly, many aspects of speci- 
fication, such as the functional form of covariates, are left unresolved by underlying 
theory. Given specification uncertainty, justification exists for specification searching 
(Sargan, 2001). However, care needs to be taken especially if small samples are an- 
alyzed and the number of specification searches is large relative to the sample size. 
When the specification search is sequential, with a large number of steps, and with 
each step determined by a previous test outcome, the statistical properties of the pro- 
cedure as a whole are complex and analytically intractable. 


8.6.4. A Practical Approach 


Applied microeconometrics research generally minimizes the problem of pretest esti- 
mation by making judicious use of hypothesis tests. Economic theory is used to guide 
the selection of regressors, to greatly reduce the number of potential regressors. If the 
sample size is large there is little purpose served by dropping “insignificant” variables. 
Final results often use regressions that include statistically insignificant regressors for 
control variables, such as region, industry, and occupation dummies in an earnings 
regression. Clutter can be avoided by not reporting unimportant coefficients in a full 
model specification but noting that fact in an appropriate place. This can lead to some 
loss of precision in estimating the key regressors of interest, such as years of school- 
ing in an earnings regression, but guards against bias caused by erroneously dropping 
variables that should be included. 

Good practice is to use only part of the sample (“training sample”) for specification 
searches and model selection, and then report results using the preferred model esti- 
mated using a completely separate part of the sample (“estimation sample”). In such 
circumstances pretesting does not affect the distribution of the estimator, if the sub- 
samples are independent. This procedure is usually only implemented when sample 
sizes are very large, because using less than the full sample in final estimation leads to 
a loss in estimator precision. 
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8.7. Model Diagnostics 


In this section we discuss goodness-of-fit measures and definitions of residuals in non- 
linear models. Useful measures are those that reveal model deficiency in some partic- 
ular dimension. 


8.7.1. Pseudo-R2 Measures 


Goodness of fit is interpreted as closeness of fitted values to sample values of the 
dependent variable. 

For linear models with K regressors the most direct measure is the standard error 
of the regression, which is the estimated standard deviation of the error term, 


r N 1/2 
S = NK di =y) . 


For example, a standard error of regression of 0.10 in a log-earnings regression means 
that approximately 95% of the fitted values are within 0.20 of the actual value of 
log-earnings, or within 22% of actual earnings using e°? ~ 1.22. This measure is the 
same as the in-sample root mean squared error where Y; is viewed as a forecast of of 
yi, aside from a degrees of freedom correction. Alternatively, one can use the mean 
absolute error (N — K)~! >>; Iyi — Jil. The same measures can be used for nonlinear 
regression models, provided the nonlinear models lead to a predicted value 9; of the 
dependent variable. 

A related measure in linear models is R?, the coefficient of multiple determina- 
tion. This explains the fraction of variation of the dependent variable explained by the 
regressors. The statistic R? is more commonly reported than s, even though s may be 
more informative in evaluating the goodness of fit. 

A pseudo-R? is an extension of R? to nonlinear regression model. There are several 
interpretations of R? in the linear model. These lead to several possible pseudo- R? 
measures that in nonlinear models differ and do not necessarily have the properties of 
lying between zero and one and increasing as regressors are added. We present several 
of these measures that, for simplicity, are not adjusted for degrees of freedom. 

One approach bases R? on decomposition of the total sum of squares (TSS), with 


Yo- = o-a + LG - YP +220 - WG — 9). 
The first sum in the right-hand side is the residual sum of squares (RSS) and the second 
term is the explained sum of squares (ESS). This leads to two possible measures: 
Rens = 1 — RSS/TSS, 
Rexp = ESS/TSS. 


For OLS regression in the linear model with intercept the third sum equals zero, so 
Rees = Rexp- However, this simplification does not occur in other models and in gen- 
eral Raz # Rep in nonlinear models. The measure Rĉ„s can be less than zero, Ryp 
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can exceed one, and both measures may decrease as regressors are added though Rĝ ps 
will increase for NLS regression of the nonlinear model as then the estimator is mini- 
mizing RSS. 

A closely related measure uses 


Réor = Cor’ [yi, i], 


the squared correlation between actual and fitted values. The measure Réop lies be- 
tween zero and one and equals R? in OLS regression for the linear model with inter- 
cept. In nonlinear models R2op can decrease as regressors are added. 

A third approach uses weighted sums of squares that control for the intrinsic het- 
eroskedasticity of cross-section data. Let G be the fitted conditional variance of y;, 
where it is assumed that heteroskedasticity is explicitly modeled as is the case for 
FGLS and for models such as logit and Poisson. Then we can use 


Riss = 1 — WRSS/WTSS, 


where the weighted residual sum of squares WRSS = $>; (y: —9;)°/G7, WTSS = 
E; Oi — @)?/G", and £ and G” are the estimated mean and variance in the intercept- 
only model. This can be called a Pearson R? because WRSS equals the Pearson 
statistic, which, aside from any finite-sample corrections, should equal N if het- 
eroskedasticity is correctly modeled. Note that Rjysg can be less than zero and decrease 
as regressors are added. 

A fourth approach is a generalization of R? to objective functions other than the sum 
of squared residuals. Let Qy(@) denote the objective function being maximized, Qo 
denote its value in the intercept-only model, Qg denote the value in the fitted model, 
and Qmax denote the largest possible value of Qy(@). Then the maximum potential 
gain in the objective function resulting from inclusion of regressors is Qmax — Qo and 
the actual gain is Og, — Qo. This suggests the measure 


R2 = Ort +: Qo Exi Omax B Ort 

ES Q max = Qo Q max = Qo , 
where the subscript RG means relative gain. For least-squares estimation the loss 
function maximized is minus the residual sum of squares. Then Qo = —TSS, Qa: = 


—RSS, and Qmax = 0, so Ree =ESS/TSS for OLS or NLS regression. The measure 
R2., has the advantage of lying between zero and one and increasing as regressors are 
added. For ML estimation the loss function is Qy(@) =In Ly(@). Then Ree cannot 
always be used as in some models there may be no bound on Qmax. For example, for 
the linear model under normality Ly(3,07) 00 as o?->0. For ML and quasi-ML 
estimation of linear exponential family models, such as logit and Poisson, Qmax iS 
usually known and Rg can be shown to be an R? based on the deviance residuals 
defined in the next section. 

A related measure to Ree is R? = | — Qft/Qo. This measure increases as re- 
gressors are added. It equals Rec if Qmax = 0, which is the case for OLS regres- 
sion and for binary and multinomial models. Otherwise, for discrete data this mea- 
sure may have upper bound less than one, whereas for continuous data the measure 
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may not be bounded between zero and one as the log-likelihood can be negative or 
positive. For example, for ML estimation with continuous density it is possible that 
Qo = 1 and Qrt = 4, leading to RŽ, = —3, or that Qo = —1 and Of = 4, leading 
to Ro =. 

For nonlinear models there is therefore no universal pseudo-R?. The most useful 
measures may be R2op, as correlation coefficients are easily interpreted, and Rj, in 
special cases that Qmax is known. Cameron and Windmeijer (1997) analyze many of 
the measures and Cameron and Windmeijer (1996) apply these measures to count data 
models. 


8.7.2. Residual Analysis 


Microeconometrics analysis actually places little emphasis on residual analysis, com- 
pared to some other areas of statistics. If data sets are small then there is concern that 
residual analysis may lead to overfitting of the model. If the data set is large then 
there is a belief that residual analysis may be unnecessary as a single observation will 
have little impact on the analysis. We therefore give a brief summary. A more exten- 
sive discussion is given in, for example, McCullagh and Nelder (1989) and Cameron 
and Trivedi (1998, chapter 5). Econometricians have had particular interest in defining 
residuals in censored and truncated models. 

A wide range of residuals have been proposed for nonlinear regression models. 
Consider a scalar dependent variable y; with fitted value ¥; = f; = (x; 0). The raw 
residual is r; = y; — f;. The Pearson residual is the obvious correction for het- 
eroskedasticity p; = (y; — 4;)/G;, where G; is an estimate of the conditional variance 
of y;. This requires a specification of the variance for y;, which is done for models 
such as the Poisson. For an LEF density (see Section 5.7.3) the deviance residual is 
di = sign(y; — ;)./2U1(0;) — 1(u;)], where I(y) denotes the log-density of y|u eval- 
uated at u = y and /(j2) denotes evaluation at u = 7. A motivation for the deviance 
residual is that the sum of squares of these residuals is the deviance statistic that is 
the generalization for LEF models of the sum of raw residuals in the linear model. The 
Anscombe residual is defined to be the transformation of y that is closest to normality, 
then standardized to mean zero and variance 1. This transformation has been obtained 
for LEF densities. 

Small-sample corrections to residuals have been proposed to account for estima- 
tion error in 7/;. For the linear model this entails division of residuals by VI — hj;, 
where h;; is the ith diagonal entry in the hat matrix H = X(X'X)~'X. These residu- 
als are felt to have better finite-sample performance. Since H has rank K, the num- 
ber of regressors, the average value of h;; is K/N and values of h;; in excess of 
2K/N are viewed as having high leverage. These results extend to LEF models 
with H = W'/?X(X’WX)-'!XW!'/?, where W = Diag[w;;] and w; = g'(x;3)/o7 with 
g(x, 3) and o? the specified conditional mean and variance, respectively. McCullagh 
and Nelder (1989) provide a summary. 

More generally, Cox and Snell (1968) define a generalized residual to be any scalar 
function r; = r (yi, Xi, 6) that satisfies some relatively weak conditions. One way that 
such residuals arise is that many estimators have first-order conditions of the form 
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X goi, Ori, Xi, 6) = 0, where y; appears in the scalar r(-) but not in the vector g(-). 
See also White (1994). 

For regression models based on a normal latent variable (see Chapters 14 and 16) 
Chesher and Irish (1987) propose using E[e¥|y;] as the residual, where y* = ju; + €7 
is the unobserved latent variable and y; = g(y;) is the observed dependent variable. 
Particular choices of g(-) correspond to the probit and Tobit models. Gouriéroux et al. 
(1987) generalize this approach to LEF densities. A natural approach in this context 
is to treat residuals as missing data, along the lines of the expectation maximum algo- 
rithm in Section 10.3. 

A common use of residuals is in plots against other variables of interest. Plots of 
residuals against fitted values can reveal poor model fit; plots of residuals against omit- 
ted variables can suggest further regressors to include in the model; and plots of resid- 
uals against included regressors can suggest need for a different functional form. It can 
be helpful to include a nonparametric regression line in such plots, (see Chapter 9). If 
data take only a few discrete values the plots can be difficult to interpret because of 
clustering at just a few values, and it can be helpful to use a so-called jitter feature that 
adds some random noise to the data to reduce the clustering. 

Some parametric models imply that an appropriately defined residual should be 
normally distributed. This can be checked by a normal scores plot that orders residuals 
r; from smallest to largest and plots them against the values predicted if the resid- 
uals were exactly normally distributed. Thus plot ordered r; against F + s,®~!((i — 
0.5)/N), where F and s, are the sample mean and standard deviation of r and ®~!(-) 
is the inverse of the standard normal cdf. 


8.7.3. Diagnostics Example 


Table 8.3 uses the same data-generating process as in Section 8.5.5. The dependent 
variable y has sample mean 1.92 and standard deviation 1.84. Poisson regression of y 
on x3 and of y on x3 and x? yields 


Model 1: E[y|x] = exp(0.586 + 0.389x3), 
(5.20) (7.60) 


Model 2: Ely|x] = exp(0.493 + 0.359x3 + 0.091x3), 
(5.14) (5.10) (1.78) 


where f-statistics are given in parentheses. 

In this example all R? measures increase with addition of x as regressor, though 
by quite different amounts given that in this example all but the last R? have similar 
values. More generally the first three R are scaled similarly and Riz, and Ržog can 
be quite close, but the remaining three measures are scaled quite differently. Only the 
last two R? measures are guaranteed to increase as a regressor is added, unless the 
objective function is the sum of squared errors. The measure R2,, can be constructed 
here, as the Poisson log-likelihood is maximized if the fitted mean (2; = y; for all i, 
leading to Qmax = };[y; In y; — y; — In y;!], where y In y = 0 when y = 0. 

Additionally, three residuals were calculated for the second model. The sample 
mean and standard deviation of residuals were, respectively, 0 and 1.65 for the raw 
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Table 8.3. Pseudo Rs: Poisson Regression Example“ 


Diagnostic Model 1 Model 2 Difference 
s where s? = RSS/(N-K) 0.1662 0.1661 0.0001 
R2eg = 1 — RSS/TSS 0.1885 0.1962 +0.0077 
R2xp = ESS/TSS 0.1667 0.2087 +0.0402 
Ree Cor [yi yi] 0.1893 0.1964 +0.0067 
R? ss = 1 — WRSS/WTSS 0.1562 0.1695 +0.0233 
Rag = (Qtit—Qo)/(Qmax— Qo) 0.1552 0.1712 +0.0160 
R = 1-—Qst/ Qo 0.0733 0.0808 +0.0075 


a N = 100. Model 1 is Poisson regression of y on intercept and x3. Model 2 is Poisson regression of y 
on intercept, x3, and x2: RSS is residual sum of squares (SS), ESS is explained SS, TSS is total sum 
of squares, WRSS is weighted RSS, WTSS is weighted TSS, Q ft is fitted value of objective function, 
Qg is fitted value in intercept-only model, and Qmax is the maximum possible value of the objective 
function given the data and exists only for some objective functions. 


residuals, 0.01 and 1.97 for the Pearson residuals, and —0.21 and 1.22 for the deviance 
residuals. The zero mean for the raw residual is a property of Poisson regression with 
intercept included that is shared by very few other models. The larger standard devia- 
tion of the raw residuals reflects the lack of scaling and the fact that here the standard 
deviation of y exceeds 1. The correlations between pairs of these residuals all exceed 
0.96. This is likely to happen when R? is low so that J; ~ J. 


8.8. Practical Considerations 


m-Tests and Hausman tests are most easily implemented by use of auxiliary regres- 
sions. One should be aware that these auxiliary regressions may be valid only under 
distributional assumptions that are stronger than those made to obtain the usual robust 
standard errors of regression coefficients. Some robust tests have been presented in 
Section 8.4. 

With a large enough data set and fixed significance level such as 5% the sample mo- 
ment conditions implied by a model will be rejected, except in the unrealistic case that 
all aspects of the model—functional form, regressors, and distribution — are correctly 
specified. In classical testing situations this is often a desired result. In particular, with 
a large enough sample, regression coefficients will always be significantly different 
from zero and many studies seek such a result. However, for specification tests the 
desire is usually to not reject, so that one can say that the model is correctly specified. 
Perhaps for this reason specification tests are under-utilized. 

As an illustration, consider tests of correct specification of life-cycle models of 
consumption. Unless samples are small a dedicated specification tester is likely to 
reject the model at 5%. For example, suppose a model specification test statistic 
is x7(12) distributed when applied to a sample with N =3,000 has a p-value of 
0.02. It is not clear that the life-cycle model is providing a poor explanation of the 
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data, even though it would be formally rejected at the 5% significance level. One 
possibility is to increase the critical value as sample size increases using BIC (see 
Section 8.5.1). 


Another reason for underutilization of specification tests is difficulty in computation 


and poor size property of tests when more convenient auxiliary regressions are used 
to implement an asymptotically equivalent version of a test. These drawbacks can be 
greatly reduced by use of the bootstrap. Chapter 11 presents bootstrap methods to 
implement many of the tests given in this chapter. 


8.2 


8.3 


8.4 


8.5 


8.7 


8.9. Bibliographic Notes 


The conditional moment test, due to Newey (1985) and Tauchen (1985), is a generalization 
of the information matrix test of White (1982). For ML estimation, the computation of the 
m_-test by auxiliary regression generalizes methods of Lancaster (1984) and Chesher (1984) 
for the IM test. A good overview of m-tests is given in Pagan and Vella (1989). The m-test 
provides a very general framework for viewing testing. It can be shown to nest all tests, 
such as Wald, LM, LR, and Hausman tests. This unifying element is emphasized in White 
(1994). 

The Hausman test was proposed by Hausman (1978), with earlier references already given 
in Section 8.3 and a good survey provided by Ruud (1984). 

The econometrics texts by Greene (2003), Davidson and McKinnon (1993) and Wooldridge 
(2002) present many of the standard specification tests. 

Pesaran and Pesaran (1993) discuss how the Cox (1961, 1962b) nonnested test can be 
implemented when an analytical expression for the expectation of the log-likelihood is not 
available. Alternatively, the test of Vuong (1989) can be used. 

Model diagnostics for nonlinear models are often obtained by extension of results for the 
linear regression model to generalized linear models such as logit and Poisson models. A 
detailed discussion with references to the literature is given in Cameron and Trivedi (1998, 
Chapter 5). 


Exercises 


8-1 Suppose y= x $ + u, where u ~ N’[0,c], with parameter vector 0 = [8', o°] and 


density f(yi@) = (1/V2ro)exp[-(y — x’B)*/207]. We have a sample of N inde- 
pendent observations. 


(a) Explain why a test of the moment condition E[x(y— x’)°] is a test of the 
assumption of normally distributed errors. 

(b) Give the expressions for M; and S; given in (8.5) necessary to implement the 
m-test based on the moment condition in part (a). 

(c) Suppose dim[x] =10, N = 100, and the auxiliary regression in (8.5) yields an 
uncentered R? of 0.2. What do you conclude at level 0.05? 

(d) For this example give the moment conditions tested by White’s information 
matrix test. 


8-2 Consider the multinomial version of the PCGF test given in (8.23) with pj replaced 


by Ð; = NF Oe; 6). Show that PCGF can be expressed as CGF in (8.27) 
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8-3 


8-4 


8-5 
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with V = Diag[ N pj]. [Conclude that in the multinomial case Andrew’s test statistic 
simplifies to Pearson’s statistic.] 


(Adapted from Amemiya, 1985). For the Hausman test given in Section 8.4.1 let 

Vi = vð], V22 = V[6], and Vi2 = Cov, 6]. 

(a) Show that the estimator 6 = 6+ [V11 + V22 — Viol (0, 6) has asymptotic 
variance matrix V[6] = V11 — [V11 — VaallVa1 + V22 — 2Vi2)- 1[V41 — Via). 

(b) Hence show that V[6] is less than víð] i in the matrix sense unless Covia, ð] = 
víð]. 

(c) Now suppose that @ is fully efficient. Can V[6] be less than Ve]? What do 
you conclude? 


Suppose that two models are non-nested and there are N = 200 observations. 

For model 1, the number of parameters q = 10 and In L = —400. For model 3, 

g= 10 and InL = —380. 

(a) Which model is favored using AIC? 

(b) Which model is favored using BIC? 

(c) Which model would be favored if the models were actually nested and we 
used a likelihood ratio test at level 0.05? 


Use the health expenditure data of Section 16.6. The model is a probit regres- 

sion of DMED, an indicator variable for positive health expenditures, against the 

17 regressors listed in the second paragraph of Section 16.6. You should obtain 

the estimates given in the first column of Table 16.1. 

(a) Test the joint statistical significance of the self-rated health indicators HLTHG, 
HLTHF, and HLTHP at level 0.05 using a Hausman test. [This may require 
some additional coding, depending on the package used.] 

(b) Is the Hausman test the best test to use here? 

(c) Does an information matrix test at level 0.05 support the restrictions of this 
model? [This will require some additional coding.] 

(d) Discriminate between a model that drops HLTHG, HLTHF, and HLTHP anda 
model that drops LC, IDP, and LPI on the basis of Ric, Réyp, Rĉop and 
Re. 
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CHAPTER 9 


Semiparametric Methods 


9.1. Introduction 


In this chapter we present methods for data analysis that require less model specifica- 
tion than the methods of the preceding chapters. 

We begin with nonparametric estimation. This makes very minimal assumptions 
regarding the process that generated the data. One leading example is estimation of 
a continuous density using a kernel density estimate. This has the attraction of pro- 
viding a smoother estimate than the familiar histogram. A second leading example 
is nonparametric regression, such as kernel regression, on a scalar regressor. This 
places a flexible curve on an (x, y) scatterplot with no parametric restrictions on the 
form of the curve. Nonparametric estimates have numerous uses, including data de- 
scription, exploratory analysis of data and of fitted residuals from a regression model, 
and summary across simulations of parameter estimates obtained from a Monte Carlo 
study. 

Econometric analysis emphasizes multivariate regression of a scalar y on a vector 
of regressors x. However, nonparametric methods, although theoretically possible with 
an infinitely large sample, break down in practice because the data need to be sliced in 
several dimensions, leading to too few data points in each slice. 

As a result econometricians have focused on semiparametric methods. These com- 
bine a parametric component, greatly reducing the dimensionality, with a nonpara- 
metric component. One important application is to permit more flexible models of the 
conditional mean. For example, the conditional mean E[y|x] may be parameterized to 
be of the single-index form g(x’), where the functional form for g(-) is not specified 
but is instead nonparametrically estimated, along with the unknown parameters 3. An- 
other important application relaxes distributional assumptions that if misspecified lead 
to inconsistent parameter estimates. For example, we may wish to obtain consistent 
estimates of 3 in a linear regression model y = x3 + £ when data on y are trun- 
cated or censored (see Chapter 16), without having to correctly specify the particular 
distribution of the error term €. 
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The asymptotic theory for nonparametric methods differs from that for more para- 
metric methods. Estimates are obtained by cutting the data into ever smaller slices as 
N — œ and estimating local behavior within each slice. Since less than N observa- 
tions are being used in estimating each slice the convergence rate is slower than that 
obtained in the preceding chapters. Nonetheless, in the simplest cases nonparamet- 
ric estimates are still asymptotically normally distributed. In some leading cases of 
semiparametric regression the estimators of parameters G have the usual property of 
converging at rate N~!/?, so that scaling by V/N leads to a limit normal distribution, 
whereas the nonparametric component of the model converges at a slower rate N7”, 
r < 1/2. 

Because nonparametric methods are local averaging methods, different choices of 
localness lead to different finite-sample results. In some restrictive cases there are rules 
or methods to determine the bandwidth or window width used in local averaging, just 
as there are rules for determining the number of bins in a histogram given the number 
of observations. In addition, it is common practice to use the nonscientific method of 
choosing the bandwidth that gives a graph that to the eye looks reasonably smooth yet 
is still capable of picking up details in the relationship of interest. 

Nonparametric methods form the bulk of this chapter, both because they are of 
intrinsic interest and because they are an essential input for semiparametric methods, 
presented most notably in the chapters on discrete and censored dependent-variable 
models. Kernel methods are emphasized as they are relatively simple to present and 
because “It is argued that all smoothing methods are in an asymptotic sense essentially 
equivalent to kernel smoothing” (Härdle, 1990, p. xi). 

Section 9.2 provides examples of nonparametric density estimation and nonpara- 
metric regression applied to data. Kernel density estimation is presented in Section 
9.3. Local regression is discussed in Section 9.4, to provide motivation for the formal 
treatment of kernel regression given in Section 9.5. Section 9.6 presents nonparamet- 
ric regression methods other than kernel methods. The vast topic of semiparametric 
regression is then introduced in Section 9.7. 


9.2. Nonparametric Example: Hourly Wage 


As an example we consider the hourly wage and education for 175 women aged 
36 years who worked in 1993. The data are from the Michigan Panel Survey of In- 
come Dynamics. It is easily established that the distribution of the hourly wage is 
right-skewed and so we model In wage, the natural logarithm of the hourly wage. 

We give just one example of nonparametric density estimation and one of nonpara- 
metric regression and illustrate the important role of bandwidth selection. Sections 9.3 
to 9.6 then provide the underlying theory. 


9.2.1. Nonparametric Density Estimate 


A histogram of the natural logarithm of wage is given in Figure 9.1. To provide detail 
the bin width is chosen so that there are 30 bins, each of width about 0.20. This is an 
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Histogram for Log Wage 
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Figure 9.1: Histogram for natural logarithm of hourly wage. Data for 175 U.S. women aged 
36 years who worked in 1993. 


unusually narrow bin width for only 175 observations, but many details are lost with 
a larger bin width. The log-wage data seem to be reasonably symmetric, though they 
are possibly slightly left-skewed. 

The standard smoothed nonparametric density estimate is the kernel density esti- 
mate defined in (9.3). Here we use the Epanechnikov kernel defined in Table 9.1. 

The essential decision in implementation is the choice of bandwidth. For this ex- 
ample Silverman’s plug-in estimate defined in (9.13) yields bandwidth of h = 0.545. 
Then the kernel estimate is a weighted average of those observations that have log 
wage within 0.21 units of the log wage at the current point of evaluation, with great- 
est weight placed on data closest to the current point of evaluation. Figure 9.2 presents 
three kernel density estimates, with bandwidths of 0.273, 0.545 and 1.091, respectively 


Density Estimates as Bandwidth Varies 


SS SSP One-half plug-in 


Plug-in 


Hihat Two times plug-in 


Kernel density estimates 


Log Hourly Wage 


Figure 9.2: Kernel density estimates for log wage for three different bandwidths using the 
Epanechnikov kernel. The plug-in bandwidth is h = 0.545. Same data as Figure 9.1. 
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corresponding to one-half the plug-in, the plug-in, and two times the plug-in band- 
width. Clearly the smallest bandwidth is too small as it leads to too jagged a density es- 
timate. The largest bandwidth oversmooths the data. The middle bandwidth, the plug- 
in value of 0.545, seems the best choice. It gives a reasonably smooth density estimate. 

What might we do with this kernel density estimate? One possibility is to compare 
the density to the normal, by superimposing a normal density with mean equal to the 
sample mean and variance equal to the sample variance. The graph is not reproduced 
here but reveals that the kernel density estimate with preferred bandwidth 0.545 is con- 
siderably more peaked than the normal. A second possibility is to compare log-wage 
kernel density estimates for different subgroups, such as by educational attainment or 
by full-time or part-time work status. 


9.2.2. Nonparametric Regression 


We consider the relationship between log wage and education. The nonparametric 
method used here is the Lowess local regression method, a local weighted average 
estimator (see Equation (9.16) and Section 9.6.2). 

A local weighted regression line at each point x is fitted using centered subsets that 
include the closest 0.8N observations, the program default, where N is the sample 
size, and the weights decline as we move away from x. For values of x near the end 
points, smaller uncentered subsets are used. 

Figure 9.3 gives a scatter plot of log wage against education and three Lowess 
regression curves for bandwidths of 0.8, 0.4 and 0.1. The first two bandwidths give 
similar curves. The relationship appears to be quadratic, but this may be speculative as 
the data are relatively sparse at low education levels, with less than 10% of the sample 
having less than 10 years of schooling. For the majority of the data a linear relationship 
may also work well. For simplicity we have not presented 95% confidence intervals or 
bands that might also be provided. 


Nonparametric Regression as Bandwidth Varies 
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Figure 9.3: Nonparametric regression of log wage on education for three different band- 
widths using Lowess regression. Same sample as Figure 9.1. 
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9.3. Kernel Density Estimation 


Nonparametric density estimates are useful for comparison across different groups and 
for comparison to a benchmark density such as the normal. Compared to a histogram 
they have the advantage of providing a smoother density estimate. A key decision, 
analogous to choosing the number of bins in a histogram, is bandwidth choice. We 
focus on the standard nonparametric density estimator, the kernel density estimator. A 
detailed presentation is given as results also relevant for regression are more simply 
obtained for density estimation. 


9.3.1. Histogram 


A histogram is an estimate of the density formed by splitting the range of x into 
equally spaced intervals and calculating the fraction of the sample in each interval. 

We give a more formal presentation of the histogram, one that extends naturally to 
the smoother kernel density estimator. Consider estimation of the density f(x) of a 
scalar continuous random variable x evaluated at xp. Since the density is the derivative 
of the cdf F(x) (i.e., f (xo) = d F(xọ)/dx) we have 


F(xo +h) — F(xo — h) 


fo) = lim 


2h 
. Pr[xo— h <x < xo +h] 
= lim : 
h->0 2h 
For a sample {x;, i = 1,..., N} of size N, this suggests using the estimator 


A 1 & Ixo — h < xi < xo + h) 
Suist(xo) = bs s : ; 


Th (9.1) 


i=l 
where the indicator function 


1 if event A occurs, 

aS | 0O otherwise. 
The estimator Farst(xo) is a histogram estimate centered at xg with bin width 2h, since 
it equals the fraction of the sample that lies between x9 — h and xo + h divided by the 
bin width 2h. If fust is evaluated over the range of x at equally spaced values of x, 
each 2h units apart, it yields a histogram. 

The estimator fpst(xo) gives all observations in x9 + h equal weight as is clear 
from rewriting (9.1) as 

< i) ; (9.2) 


Fae D lya 

xo) = x 
Justo) = a7 23 
This leads to a density estimate that is a step function, even if the underlying density 
is continuous. Smoother estimates can be obtained by using weighting functions other 
than the indicator function chosen here. 
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9.3. KERNEL DENSITY ESTIMATION 


9.3.2. Kernel Density Estimator 


The kernel density estimator, introduced by Rosenblatt (1956), generalizes the his- 
togram estimate (9.2) by using an alternative weighting function, so 


as T A Xi — Xo 
Foo) = 55 DK ( ; ) (9.3) 


The weighting function K (-) is called a kernel function and satisfies restrictions given 
in the next section. The parameter h is a smoothing parameter called the bandwidth, 
and two times h is the window width. The density is estimated by evaluating F (xo) at 
a wider range of values of xo than used in forming a histogram; usually evaluation is 
at the sample values x1, ..., xy. This also helps provide a density estimate smoother 
than a histogram. 


9.3.3. Kernel Functions 


The kernel function K(-) is a continuous function, symmetric around zero, that inte- 
grates to unity and satisfies additional boundedness conditions. Following Lee (1996) 
we assume that the kernel satisfies the following conditions: 


(i) K(z) is symmetric around 0 and is continuous. 
(ii) f K(z)dz = 1, f zK(z)dz = 0, and f |K(z)|dz < oo. 
(iii) Either (a) K(z) = 0 if |z| > zo for some zo or (b) |z|K(z) > Oas |z| > œ. 


(iv) f z? K(z)dz = kK, where k is a constant. 


In practice kernel functions work better if they satisfy condition (iiia) rather than 
just the weaker condition (iiib). Then restricting attention to the interval [—1, 1] rather 
than [—Zo, Zo] is simply a normalization for convenience, and usually K (z) is restricted 
toz € [-1, 1]. 

Some commonly used kernel functions are given in Table 9.1. The uniform kernel 
uses the same weights as a histogram of bin width 2h, except that it produces a running 
histogram that is evaluated at a series of points xo rather than using fixed bins. The 
Gaussian kernel satisfies (iiib) rather than (iiia) because it does not restrict z € [—1, 1]. 
A pth-order kernel is one whose first nonzero moment is the pth moment. The first 
seven kernels are of second order and satisfy the second condition in (ii). The last 
two kernels are fourth-order kernels. Such higher order kernels can increase rates of 
convergence if f(x) is more than twice differentiable (see Section 9.3.10), though they 
can take negative values. Table 9.1 also gives the parameter 6, defined in (9.11) and 
used in Section 9.3.6 to aid bandwidth choice, for some of the kernels. 

Given K(-) and h the estimator is very simple to implement. If the kernel estimator 
is evaluated at r distinct values of xo then computation of the kernel estimator requires 
at most Nr operations, when the kernel has unbounded support. Considerable compu- 
tational savings on this are possible; see, for example, Hardle (1990, p. 35). 


299 


SEMIPARAMETRIC METHODS 


Table 9.1. Kernel Functions: Commonly Used Examples“ 


Kernel Kernel Function K (z) 6 
Uniform (or box or rectangular) 5 x 1(|z| < 1) 1.3510 
Triangular (or triangle) (1 — |z) x 1(/z| < 1) - 
Epanechnikov (or quadratic) aad — 27) x 1(\z| < 1) 1.7188 
Quartic (or biweight) Ba = zP x Iz) < 1) 2.0362 
Triweight KA= z$ x (z| < 1) 2.3122 
Tricubic Pa -= zPY x A(z] < 1) = 
Gaussian (or normal) ry? exp(—z? /2) 0.7764 
Fourth-order Gaussian 5(3 — z)?(2n)7!/? exp(—z7/2) — 
Fourth-order quartic 33 — 10z* + 7z*) x 1(\z| < 1) — 


“ The constant ô is defined in (9.11) and is used to obtain Silverman’s plug-in estimate given in (9.13). 


9.3.4. Kernel Density Example 


The key choice of bandwidth h has already been illustrated in Figure 9.2. 

Here we illustrate the choice of kernel using generated data, a random sample of 
size 100 drawn from the V[0, 257] distribution. For the particular sample drawn the 
sample mean is 2.81 and the sample standard deviation is 25.27. 

Figure 9.4 shows the effect of using different kernels. For Epanechnikov, Gaussian, 
quartic and uniform kernels, Silverman’s plug-in estimate given in (9.13) yields band- 
widths of, respectively, 0.545, 0.246, 0.246, and 0.214. The resulting kernel density 
estimates are very similar, even for the uniform kernel which produces a running 
histogram. The variation in density estimate with kernel choice is much less than the 
variation with bandwidth choice evident in Figure 9.2. 


Density Estimates as Kernel Varies 


oJ a 


Epanechnikov (h=0.545) 
Gaussian (h=0.246) 
Quartic (h=0.646) 
Uniform (h=0.214) 


Kernel density estimates 


Log Hourly Wage 


Figure 9.4: Kernel density estimates for log wage for four different kernels using the corre- 
sponding Silverman’s plug-in estimate for bandwidth. Same data as Figure 9.1. 
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9.3.5. Statistical Inference 


We present the distribution of the kernel density estimator F(x) for given choice of 
K(-) and h, assuming the data x are iid. The estimate F(x) is biased. This bias goes to 
zero asymptotically if the bandwidth h — 0 as N — oo, so F(x) is consistent. How- 
ever, the bias term does not necessarily disappear in the asymptotic normal distribution 
for F(x), complicating statistical inference. 


Mean and Variance 


The mean and variance of F (xo) are obtained in Section 9.8.1, assuming that the second 
derivative of f(x) exists and is bounded and that the kernel satisfies f zK(z)dz = 0, 
as assumed in property (ii) of Section 9.3.3. 

The kernel density estimator is biased with bias term b(xo) that depends on the 
bandwidth, the curvature of the true density, and the kernel used according to 


~ 1 ” 
b(xo) = EL f (x0)] — fo) = ar (x0) | 2K (dz (9.4) 


The kernel estimator is biased of size O(h7), where we use the order of magnitude 
notation that a function a(h) is O(h*) if a(h)/h'* is finite. The bias disappears asymp- 
totically if h > 0 as N > oo. 

Assuming h — 0 and N —> oo, the variance of the kernel density estimator is 


ed 1 1 
VIFO] = Foo) f EO dato (a) (0.5) 


where a function a(h) is o(h*) if a(h)/h* — 0. The variance depends on the sample 
size, bandwidth, the true density, and the kernel. The variance disappears if Nh — ov, 
which requires that while h — 0 it must do so at a slower rate than N —> oo. 


Consistency 


The kernel estimator is pointwise consistent, that is, consistent at a particular point 
x = xo, if both the bias and variance disappear. This is the case if h —> 0 and 
Nh > œ. 

For estimation of f(x) at all values of x the stronger condition of uniform conver- 
gence, that is, sup, | F (xo) — f(xo)| 4 0, can be shown to occur if Nh/In N > œ. 
This requires / larger than for pointwise convergence. 


Asymptotic Normality 


The preceding results show that asymptotically F (xo) has mean f(xo) + b(xo) and 
variance (Nh)~! f (xo) f K(z)*dz. It follows that if a central limit theorem can be 
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applied, the kernel density estimator has limit distribution 


V NAG (x0) — f0) — blo) SN [o, f (xo) / Kaz]. (9.6) 


The central limit theorem applied is a nonstandard one and requires condition (iv); see, 
for example, Lee (1996, p. 139) or Pagan and Ullah (1999, p. 40). 

It is important to note the presence of the bias term b(xo), defined in (9.4). For 
typical choices of bandwidth this term does not disappear, complicating computation 
of confidence intervals (presented in Section 9.3.7). 


9.3.6. Bandwidth Choice 


The choice of bandwidth A is much more important than choice of kernel function 
K(-). There is a tension between setting h small to reduce bias and setting h large to 
ensure smoothness. A natural metric to use is therefore mean-squared error (MSE), 
the sum of bias squared and variance. 

From (9.4) the bias is O(h?) and from (9.5) the variance is O((Nh)~!). Intu- 
itively MSE is minimized by choosing h so that bias squared and variance are of the 
same order, so h4 = (Nh)~', which implies the optimal bandwidth h = O(N~°”) and 
J/Nh = O(N°*). We now give a more formal treatment that includes a practical plug- 
in estimate for h. 


Mean Integrated Squared Error 


A local measure of the performance of the kernel density estimate at xo is the MSE 
MSE f(xo)] = Elf (ao) — f (x0))?], (9.7) 


where the expectation is with respect to the density f(x). Since MSE equals variance 
plus squared bias, (9.4) and (9.5) yield the MSE of the kernel density estimate 


a 1 Eii 2 
MSELf (x0)] = >> f(@0) f K(z}dz + [zs (xo) f Kæ] . (9.8) 


To obtain a global measure of performance at all values of xọ we begin by defining 
the integrated squared error (ISE) 


ISE(h) = / (F (x0) — f(%0))dxo, (9.9) 


the continuous analogue of summing squared error over all xo in the discrete case. 
This is written as a function of h to emphasize dependence on the bandwidth. We then 
eliminate the dependence of f(x) on x values other than xo by taking the expected 
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value of the ISE with respect to the density f(x). This yields the mean integrated 
squared error (MISE), 
MISE(h) = E [ISE(h)] 


=E | o) oE foods) 
= / ELF (x0) — f(x0))"Jdxo 


= | MSE[ f (xo)]d xo, 


where MSE[ f(x)] is defined in (9.8). From the preceding algebra MISE equals the 
integrated mean-squared error (IMSE). 


Optimal Bandwidth 


The optimal bandwidth minimizes MISE. Differentiating MISE(h) with respect to A 
and setting the derivative to zero yields the optimal bandwidth 


—0.2 
h*=8 ( / "Go'dsn) N? (9.10) 


where ô depends on the kernel function used, with 


_( [K@Paz 


- (rreg) 


This result is due to Silverman (1986). 

Since h* = O(N~°?), we have h* — 0 as N —> œ and Nh* = O(N) > œ 
as required for consistency. The bias in F (x0) is O(h*?) = O(N~°*), which disap- 
pears as N — oo. For a histogram estimate it can be shown that h* = O(N~°?) 
and MISE(h*) = O(N~7/?), inferior to MISE(h*) = O(N~*/>) for the kernel density 
estimate. 

The optimal bandwidth depends on the curvature of the density, with h* lower if 
f(x) is highly variable. 


(9.11) 


Optimal Kernel 


The optimal bandwidth varies with the kernel (see (9.10) and (9.11)). It can be shown 
that MISE(h*) varies little across kernels, provided different optimal h* are used for 
different kernels (Figure 9.4 provides an illustration). It can be shown that the optimal 
kernel is the Epanechnikov, though this advantage is slight. 

Bandwidth choice is much more important than kernel choice and from (9.10) this 
varies with the kernel. 
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Plug-in Bandwidth Estimate 


A plug-in estimate for the bandwidth is a simple formula for h that depends on the 
sample size N and the sample standard deviation s. 

A useful starting point is to assume that the data are normally distributed. Then 
f f (xo dxo = 3/(8./10°) = 0.2116/o°, in which case (9.10) specializes to 


h* = 1.36435N~°?s, (9.12) 


where s is the sample standard deviation of x and ô is given in Table 9.1 for several 
kernels. For the Epanechnikov kernel h* = 2.345N~°*s, and for the Gaussian kernel 
h* = 1.059N~°*s. The considerably lower bandwidth for the normal kernel arises 
because, unlike most kernels, the normal kernel gives some weight to x; even if |x; — 
Xo| > h. In practice one uses Silverman’s plug-in estimate 


h* = 1.36435N~°? min(s, igr/1.349), (9.13) 


where igr is the sample interquartile range. This uses igr/1.349 as an alternative 
estimate of o that protects against outliers, which can increase s and lead to too large 
anh. 

These plug-in estimates for h work well in practice, especially for symmetric uni- 
modal densities, even if f(x) is not the normal density. Nonetheless, one should also 
check by using variations such as twice and half the plug-in estimate. 

For the example in Figures 9.2 and 9.4 we have 177~°” = 0.3551, s = 0.8282, and 
igr/1.349 = 0.6459, so (9.13) yields h* = 0.31736. For the Epanechnikov kernel, for 
example, this yields h* = 0.545 since 6 = 1.7188 from Table 9.1. 


Cross- Validation 


From (9.9), ISE(h) = f f2(xo)dx0 — 2 f f (x0) f(xo)dxo + f f2(xo)dxo. The third 
term does not depend on h. An alternative data-driven approach estimates the first 
two terms in ISE(h) by 


N 
CV(h) = = pap» K® (a 2) = E (9.14) 


where K®(u) = f K(u — t)K(t)dt is the convolution of K with itself, and Fi (xi) is 
the leave-one-out kernel estimator of f (x;). See Lee (1996, p. 137) or Pagan and Ullah 
(1999, p. 51) for a derivation. The cross-validation estimate cy is chosen to mini- 
mize CV (h). It can be shown that hcy 4 h* as N > oo, but the rate of convergence 
is very slow. 

Obtaining hcy is computationally burdensome because ISE(A) needs to be com- 
puted for a range of values of h. It is often not necessary to cross-validate for kernel 
density estimation as the plug-in estimate usually provides a good starting point. 
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9.3.7. Confidence Intervals 


Kernel density estimates are usually presented without confidence intervals, but it is 
possible to construct pointwise confidence intervals for f(xo), where pointwise means 
evaluated at a particular value of xo. A simple procedure is to obtain confidence inter- 
vals at a small number of evaluation points xo, say 10, that are evenly distributed over 
the range of x and plot these along with the estimated density curves. 

The result (9.6) yields the following 95% confidence interval for f (xo): 


R ie 
f (x0) E f (Xo) — b(xo) + 1.96 x Jato f K (z)?dz. 


For most kernels f K(z)*dz is easily obtained by analytical methods. 

The situation is complicated by the bias term, which should not be ignored in finite 
samples, even though asymptotically b(xo) +, 0. This is because with optimal band- 
width h* = O(N~°*) the bias of the rescaled random variable VNACf (xo) — f(xo)) 
given in (9.6) does not disappear, since /Nh* times O(h*?) = O(1). The bias can be 
estimated using (9.4) and a kernel estimate of f (xo), but in practice the estimate of 
f (xo) is noisy. Instead, the usual method is to reduce the bias in computing the confi- 
dence interval, but not F (x0) itself, by undersmoothing, that is, by choosing h < h* so 
that h* = o(N~°). Other approaches include using a higher order kernel, such as the 
fourth-order kernels given in Table 9.1, or bootstrapping (see Section 11.6.5). 

One can also compute confidence bands for f(x) over all possible values of x. 
These are wider than the pointwise confidence intervals for each value xo. 


9.3.8. Estimation of Derivatives of a Density 


In some cases estimates of the derivatives of a density need to be made. For example, 
estimation of the bias term of F (xo) given in (9.4) requires an estimate of f” (xo). 

For simplicity we present estimates of the first derivative. A finite-difference 
approach uses F (xo) = [ F (xo + A) — F (xo — A)]/2A. A calculus approach in- 
stead takes the first derivative of F (xo) in (9.3), yielding F (xo) = —(Nh?)7! 
E; K' (x; — xo)/h). 

Intuitively, a larger bandwidth should be used to estimate derivatives, which can be 
more variable than f (xo). The bias of FOxo) is as before but the variance converges 
more slowly, leading to optimal bandwidth h* = O(N7!/25+??+)) if f(xọ) is p times 
differentiable. For kernel estimation of the first derivative we need p > 3. 


9.3.9. Multivariate Kernel Density Estimate 


The preceding discussion considered kernel density estimation for scalar x. For the 
density of the k-dimensional random variable x, the multivariate kernel density esti- 
mator is 
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where K(-) is now a k-dimensional kernel. Usually K(-) is a product kernel, the prod- 
uct of one-dimensional kernels. Multivariate kernels such as the multivariate normal 
density or spherical kernels proportionate to K (z'z) can also be used. The kernel K (-) 
satisfies properties similar to properties given in the one-dimensional case; see Lee 
(1996, p. 125). 

The analytical results and expressions are similar to those before, except that the 
variance of F (Xo) declines at rate O(Nh*), which for k > 1 is slower than O(Nh) in 
the one-dimensional case. Then 


VNI (x0) — f (Xo) — bo) + vfo, foo f Kara. 


The optimal bandwidth choice is h = O(N~'/“+), which is larger than O(N~°) in 
the one-dimensional case, and implies V Nht = O(N?/“4+), The plug-in and cross- 
validation methods can be extended to the multivariate case. For the product normal 
kernel Scott’s plug-in estimate for the jth component of x is h; = N~'/**%s;, where 
sj is the sample standard deviation of x;. 

Problems of sparseness of data are more likely to arise with a multivariate kernel. 
There is a curse of dimensionality, as fewer observations in the vicinity of xo receive 
substantial weight when x is of higher dimension. Even when this is not a problem, 
plotting even a bivariate kernel density estimate requires a three-dimensional plot that 
can be difficult to read and interpret. 

One use of a multivariate kernel density estimate is to permit estimation of a 
conditional density. Since f(y|x) = f(x, y)/f(x), an obvious estimator is F (y|x) = 
F (x, y)/ F (x), where F (x, y) and F (x) are bivariate and univariate kernel density 
estimates. 


9.3.10. Higher Order Kernels 


The preceding analysis assumes f(x) is twice differentiable, a necessary assumption to 
obtain the bias term in (9.4). If f(x) is more than twice differentiable then using higher 
order kernels (see Section 9.3.3 for fourth-order examples) reduces the order of the 
bias, leading to smaller h* and faster rates of convergence. A general statement is that 
if x is k dimensional and f(x) is p times differentiable and a pth-order kernel is used, 
then the kernel estimate F (Xo) of f(x) has optimal rate of convergence N ee ai 
when h* = O(N7!/2P+), 


9.3.11. Alternative Nonparametric Density Estimates 


The kernel density estimate is the standard nonparametric estimate. Other density es- 
timates are presented, for example, in Pagan and Ullah (1999). These often use ap- 
proaches such as nearest-neighbors methods that are more commonly used in non- 
parametric regression and are presented briefly in Section 9.6. 
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9.4. Nonparametric Local Regression 


We consider regression of scalar dependent variable y on a scalar regressor variable x. 
The regression model is 


yi =mxj)+e, i=1,...,N, 


e; ~ iid (0, o2]. sala) 


The complication is that the functional form m(-) is not specified, so NLS estimation 
is not possible. 

This section provides a simple general treatment of nonparametric regression us- 
ing local weighted averages. Specialization to kernel regression is given in Section 9.5 
and other commonly used local weighted methods are presented in Section 9.6. 


9.4.1. Local Weighted Averages 


Suppose that for a distinct value of the regressor, say xo, there are multiple obser- 
vations on y, say No observations. Then an obvious simple estimator for m(xo) is 
the sample average of these No values of y, which we denote m(xo). It follows that 
m(xo) ~ [m(xo), No ooh since it is the average of No observations that by (9.15) are 
iid with mean m(xo) and variance oĉ. 

The estimator m(xo) is unbiased but not necessarily consistent. Consistency requires 
No —> œ as N > ov, so that V[m(xo)] —> 0. With discrete regressors this estimator 
may be very noisy in finite samples because No may be small. Even worse, for con- 
tinuous regressors there may be only one observation for which x; takes the particular 
value xp, even as N — oo. 

The problem of sparseness in data can be overcome by averaging observed values 
of y when x is close to xo, in addition to when x exactly equals x9. We begin by noting 
that the estimator m(xo) can be expressed as a weighted average of the dependent 
variable, with m(xo) = >°; wioy;, where the weights wjo equal 1/No if x; = xo and 
equal 0 if x; A xo. Thus the weights vary with both the evaluation point x9 and the 
sample values of the regressors. 

More generally we consider the local weighted average estimator 


N 
m(xo) = D> wio,nYi, (9.16) 


i=l 
where the weights 
Wi0,h = W(X;, Xo, A) 


sum to one, so J`; wio,, = 1. The weights are specified to increase as x; becomes 
closer to Xo. 

The additional parameter h is generic notation for a window width parameter. It 
is defined so that smaller values of h lead to a smaller window and more weight being 
placed on those observations with x; close to xo. In the specific example of kernel 
regression, h is the bandwidth. Other methods given in Section 9.6 have alternative 
smoothing parameters that play a similar role to h here. As h becomes smaller (xo) 
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becomes less biased, as only observations close to xg are being used, but more variable, 
as fewer observations are being used. 

The OLS predictor for the linear regression model is a weighted average of y;, since 
some algebra yields 


mors(xo) = È N DEAE 


N [1 | Go — BC: =| 


The OLS weights, however, can actually increase with increasing distance between xo 
and x; if, for example, x; > x9 > x. Local regression instead uses weights that are 
decreasing in |x; — xol. 


9.4.2. K-Nearest Neighbors Example 


We consider a simple example, the unweighted average of the y values correspond- 
ing to the closest (k — 1)/2 observations on x less than xo and the closest (k — 1)/2 
observations on x greater than xo. 

Order the observations by increasing x values. Then evaluation at x9 = x; yields 


a 1 
m(xj) = 7 Vie pp + +++ Yi+k-1)/2), 


where for simplicity k is odd, and potential modifications caused by ties and values of 
xo close to the end points x; or xy are ignored. This estimator can be expressed as a 
special case of (9.16) with weight 


1 k-1 
won =z xhi- < F) Xp < X2 <- < Xo < +++ < XN. 


This estimator has many names. We refer to it as a (symmetrized) k-nearest neigh- 
bors estimator (k—NN), defined in Section 9.6.1. It is also a standard local running 
average or running mean or moving average of length k centered at xo that is used, 
for example, to plot a time series y against time x. The parameter k plays the role of 
the window width h in Section 9.4.1, with small k corresponding to small A. 

As an illustration, consider data generated from the model 


yi = 150 + 6.5x; — 0.15x? + 0.001x? +£;, i=1,..., 100, (9.17) 
xj =i, 
ei ~ N[0, 25°]. 

The mean of y is a cubic in x, with x taking values 1,2,..., 100, with turning points 


at x = 20 and x = 80. To this is added a normally distributed error term with standard 
deviation 25. 

Figure 9.5 plots the symmetrized kK—-NN estimator with k = 5 and 25. Both moving 
averages suggest a cubic relationship. The second is smoother than the first but is still 
quite jagged despite one-quarter of the sample being used to form the average. The 
OLS regression line is also given on the diagram. 
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k-Nearest Neighbors Regression as k Varies 
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Figure 9.5: k-nearest neighbors regression curve for two different choices of k, as well as 
OLS regression line. The data are generated from a cubic polynomial model. 


The slope of m(x) is flatter at the end points when k = 25 rather than k = 5. This 
illustrates a boundary problem in estimating m(x) at the end points. For example, 
for the smallest regressor value x; there are no lower valued observations on x 
to be included, and the average becomes a one-sided average m,(x1) = (yı +--+: + 
Yyı+&-1)/2)/[(k + 1)/2]. Since for these data m,(x) is increasing in x in this region, 
this leads to m,(x;) being an overestimate and the overstatement is increasing in k. 
Such boundary problems are reduced by instead using methods given in Section 9.6.2. 


9.4.3. Lowess Regression Example 


Using alternative weights to those used to form the symmetrized k-NN estimator can 
lead to better estimates of m(x). 

An example is the Lowess estimator, defined in Section 9.6.2. This provides a 
smoother estimate of m(x) as it uses kernel weights rather than an indicator func- 
tion, analogous to a kernel density estimate being smoother than a running histogram. 
It also has smaller bias (see Section 9.6.2), which is especially beneficial in estimating 
m(x) at the end points. 

Figure 9.6 plots, for data generated by (9.17), the Lowess estimate with k = 25. This 
local regression estimate is quite close to the true cubic conditional mean function, 
which is also drawn. Comparing Figure 9.6 to Figure 9.5 for symmetrized k-NN with 
k = 25, we see that Lowess regression leads to a much smoother regression function 
estimate and more precise estimation at the boundaries. 


9.4.4. Statistical Inference 


When the error term is normally distributed and analysis is conditional on x1, ..., XN, 
the exact small-sample distribution of m(xo) in (9.16) can be easily obtained. 
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Lowess Nonparametric Regression 
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Figure 9.6: Nonparametric regression curve using Lowess, as well as a cubic regression 
curve. Same generated data as Figure 9.5. 


Substituting y; = m(x;) + £; into the definition of m(xq) leads directly to 


N N 
Mo) — X wonm) = È wionêi, 


i=l i=l 


which implies with fixed regressors, and if e; are iid \’[0, oĉ], that 


N N 
Maxo) ~ N p wio,nm(xi). o2 >> vs (9.18) 
i=} i=l 


Note that in general m(xo) is biased and the distribution is not necessarily centered 
around m(xo). 

With stochastic regressors and nonnormal errors, we condition on x1, ..., Xy and 
apply a central limit theorem for U-statistics that is appropriate for double summations 
(see, for example, Pagan and Ullah, 1999, p. 359). Then for ¢; iid [0, oĉ], 


N N 
c(N) Y wionsi SN e o; lime(NY X` vn ; (9.19) 


i=1 i=1 


where c(N) is a function of the sample size with O(c(N)) < N 1/2 that can vary with 


the local estimator. For example, c(N) = Nh for kernel regression and c(N) = N°4 
for kernel regression with optimal bandwidth. Then 


N 
c(N) (x0) — m(xo) — b(xo)) S MN fo o2 lime(N) $ wa ; (9.20) 
i=l 
where b(xo) = m(xo)— >>; Wio,nm(x;). Note that (9.20) yields (9.18) for the asymp- 
totic distribution of m(xo). 
Clearly, the distribution of m(xo), a simple weighted average, can be obtained un- 
der alternative distributional assumptions. For example, for heteroskedastic errors 
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the variance in (9.19) and (9.20) is replaced by lim c(N)? X; ož wo, n» Which can be 
consistently estimated by replacing o2; by the squared residual (y; — m(x;))*. Alter- 
natively, one can bootstrap (see Section 11.6.5). 


9.4.5. Bandwidth Choice 


Throughout this chapter we follow the nonparametric terminology that an estimator 
6 of 0o has convergence rate N7” if@ = o + O,(N~), so that N'@ — 6) = O,(1) 
and ideally N "(@ — 6o) has a limit normal distribution. Note in particular that an esti- 
mator that is commonly called a /N-consistent estimator is converging at rate N~!/?. 
Nonparametric estimators typically have a slower rate of convergence than this, with 
r < 1/2, because small bandwidth h is needed to eliminate bias but then less than N 
observations are being used to estimate m(xo). 

As an example, consider the kK-NN example of Section 9.4.2. Suppose k = N*/5, so 
that for example k = 251 if N = 1,000. Then the estimator is consistent as the moving 
average uses N4/⁄5/N = NT! of the sample and is therefore collapsing around xo as 
N — oo. Using (9.18), the variance of the moving average estimator is o X; Wig, p= 
o2 x k x (1/k)? = 0? x 1/k = o2 N45, so in (9.19) c(N) = Vk = V N45 = N°4, 
which is less than N !/?. Other values of k also ensure consistency, provided k < O(N). 

More generally, a range of values of the bandwidth parameter eliminates asymptotic 
bias, but smaller bandwidth increases variability. In this literature this trade-off is ac- 
counted for by minimizing mean-squared error, the sum of variance and bias squared. 

Stone (1980) showed that if x is k dimensional and m(x) is p times differentiable 
then the fastest possible rate of convergence for a nonparametric estimator of an sth- 
order derivative of m(x) is N`”, where r = (p — s)/(2p + k). This rate decreases as 
the order of the derivative increases and as the dimension of x increases. It increases the 
more differentiable m(x) is assumed to be, approaching N~!/? if m(x) has derivatives 
of order approaching infinity. For scalar regression estimation of m(x) it is customary 
to assume existence of m’’(x), in which case r = 2/5 and the fastest convergence rate 
is N°4, 


9.5. Kernel Regression 


Kernel regression is a weighted average estimator using kernel weights. Issues such as 
bias and choice of bandwidth presented for kernel density estimation are also relevant 
here. However, there is less guidance for choice of bandwidth than in the regression 
case. Also, while we present kernel regression for pedagogical reasons, kernel local 
regression estimators are often used in practice (see Section 9.6). 


9.5.1. Kernel Regression Estimator 


The goal in kernel regression is to estimate the regression function m(x) in the model 
y = m(x) + « defined in (9.15). 
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From Section 9.4.1, an obvious estimator of m(xo) is the average of the sample 
values y; of the dependent variable corresponding to the x;s close to x9. A variation 
on this is to find the average of the y;s for all observations with x; within distance h of 
xo. This can be formally expressed as 


Pate eos 
Sate sD 


where as before 1(A) = 1 if event A occurs and equals 0 otherwise. The numerator 
sums the y values and the denominator gives the number of y values that are summed. 

This expression gives equal weights to all observations close to xo, but it may be 
preferable to give the greatest weight at x9 and decrease the weight as we move away. 
Thus more generally we consider a kernel weighting function K (-), introduced in Sec- 
tion 9.3.2. This yields the kernel regression estimator 


a it Ra) 
we Lint K (37) 


Several common kernel functions — uniform, Gaussian, Epanechnikov, and quartic — 
have already been given in Table 9.1. 

The constant h is called the bandwidth, and 2h is called the window width. The 
bandwidth plays the same role as k in the k-NN example of Section 9.4.2. 

The estimator (9.21) was proposed by Nadaraya (1964) and Watson (1964), 
who gave an alternative derivation. The conditional mean m(x) = S yf(y|x)dy = 
S yLf O, x)/f(x)]dy, which can be estimated by m(x) = f yb f(y, x)/F@]dy, where 
F( y, x) and F(x) are bivariate and univariate kernel density estimators. It can be shown 
that this equals the estimator in (9.21). The statistics literature also considers kernel re- 
gression in the fixed design or fixed regressors case where f(x) is known and need not 
be estimated, whereas we consider only the case of stochastic regressors that arises 
with observational data. 

The kernel regression estimator is a special case of the weighted average (9.16), 
with weights 


m(xo) = 


’ 


(9.21) 


m(xo) = 


iK (554) 
ar Dini K (75) 


which by construction sum over i to one. The general results of Section 9.4 are relevant, 
but we give a more detailed analysis. 


Wi0,h = (9.22) 


9.5.2. Statistical Inference 


We present the distribution of the kernel regression estimator m(x) for given choice 
of K(-) and h, assuming the data x are iid. We implicitly assume that regressors are 
continuous. With discrete regressors m(xo) will still collapse on m(xo), and both m(xo) 
in the limit and m(xo) are step functions. 


312 


9.5. KERNEL REGRESSION 


Consistency 


Consistency of m(xo) for the conditional mean function m(xo) requires  — 0, so that 
substantial weight is given only to x; very close to xo. At the same time we need many 
x; close to x9, so that many observations are used in forming the weighted average. 


Formally, m(x9) + m(xo) if h > O and Nh > œ as N > ov. 


Bias 


The kernel regression estimator is biased of size O(h7), with bias term 


b(xo) = h? (mw 0) + zma) ferox (9.23) 
fœ) 2 

(see Section 9.8.2) assuming m(x) is twice differentiable. As for kernel density estima- 
tion, the bias varies with the kernel function used. More importantly, the bias depends 
on the slope and curvature of the regression function m(xọ) and the slope of the density 
J (Xo) of the regressors, whereas for density estimation the bias depended only on the 
second derivatives of f (x9). The bias can be particularly large at the end points, as 
illustrated in Section 9.4.2. 

The bias can be reduced by using higher order kernels, defined in Section 9.3.3, and 
boundary modifications such as specific boundary kernels. Local polynomial regres- 
sion and modifications such as Lowess (see Section 9.6.2) have the attraction that the 
term in (9.23) depending on m'(xo) drops out and perform well at the boundaries. 


Asymptotic Normality 


In Section 9.8.2 it is shown that, for x; iid with density f(x;), the kernel regression 
estimator has limit distribution 


2 

~v NA(M(x0) — m(xo) — b(xo)) ZN [o, T j f Kraz] : (9.24) 
0 

The variance term in (9.24) is larger for small f (xo), so as expected the variance of 

m(xo) is larger in regions where x is sparse. 


9.5.3. Bandwidth Choice 


Incorporating values of y; for which x; 4 xo into the weighted average introduces bias, 
since E[y;|x;] = m(x;) Æ m(xo) for x; 4 xo. However, using these additional points 
reduces the variance of the estimator, since we are averaging over more data. The opti- 
mal bandwidth balances the trade-off between increased bias and decreased variance, 
using squared error loss. Unlike kernel density estimation, plug-in approaches are im- 
practical and cross-validation is used more extensively. 

For simplicity most studies focus on choosing one bandwidth for all values of xo. 
Some methods with variable bandwidths, notably kK-NN and Lowess, are given in 
Section 9.6. 
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Mean Integrated Squared Error 


The local performance of m(-) at xo is measured by the mean-squared error, given 
by 


MSE[iii(xo)] = ELMA) — m (x0))}?], 


where the expectation eliminates dependence of m(xo) on x. Since MSE equals vari- 
ance plus squared bias, the MSE can be obtained using (9.23) and (9.24). 
Similar to Section 9.3.6, the integrated square error is 


ISE(h) = i (xo) — m (x0)? f (xo)dx0, 


where f(x) denotes the density of the regressors x, and the mean integrated square 
error, or equivalently the integrated mean-squared error, is 


MISE(h) = J Msemo flor 


Optimal Bandwidth 


The optimal bandwidth h* minimizes MISE(A). This yields h* = O(N~°?) since 
the bias is O(h?) from (9.23); the variance is O((Nh)~!) from (9.24) since an O(1) 
variance is obtained after scaling m(xo) by Nh; and for bias squared and variance to 
be of the same order (h*)* = (Nh)~! or h = N~°*. The kernel estimate then converges 
to m(xo) at rate (Nh*)~!/? = N~°* rather than the usual N~°° for parametric analysis. 


Plug-in Bandwidth Estimate 


One can obtain an exact expression for h* that minimizes MISE(h), using calculus 
methods similar to those in Section 9.3.5 for the kernel density estimator. Then h* 
depends on the bias and variance expressions in (9.23) and (9.24). 

A plug-in approach calculates h* using estimates of these unknowns. However, 
estimation of m’(x), for example, requires nonparametric methods that in turn require 
an initial bandwidth choice, but h* also depends on unknowns such as m”(x). Given 
these complications one should be wary of plug-in estimates. More common is to use 
cross-validation, presented in the following. 

It can also be shown that MISE(h*) is minimized if the Epanichnikov kernel is 
used (see Härdle, 1990, p. 186, or Hardle and Linton, 1994, p. 2321), though as in the 
kernel regression case MISE(h*) is not much larger for other kernels. The key issue is 
determination of h*, which will vary with kernel and the data. 


Cross- Validation 
An empirical estimate of the optimal h can be obtained by the leave-one-out cross- 
validation procedure. This chooses h* that minimizes 
N 
CV(h) = Y Oi — hira), (9.25) 
i=l 
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where 7 (x;) is a weighting function (discussed in the following) and 


mixi) = > Wjinyj/ 5 Wji,h (9.26) 
i#i jži 
is a leave-one-out estimate of m(x;) obtained by the kernel formula (9.21), or more 
generally by a weighted procedure (9.16), with the modification that y; is dropped. 
Cross-validation is not as computationally intensive as it first appears. It can be 
shown that 


Yi — m(x) 


1 — [wi Par Win 


(9.27) 


Sie m(x) = 


so that for each value of h cross-validation requires only one computation of the 
weighted averages m(x;),i =1,..., N. 

The weights 7 (x;) are introduced to potentially downweight the end points, which 
otherwise may receive too much importance since local weighted estimates can be 
quite highly biased at the end points as illustrated in Section 9.4.2. For example, ob- 
servations with x; outside the 5th to 95th percentiles may not be used in calculating 
CV(h), in which case 2 (x;) = 0 for these observations and 7 (x;) = 1 otherwise. The 
term cross-validation is used as it validates the ability to predict the ith observation us- 
ing all the other observations in the data set. The ith observation is dropped because if 
instead it was additionally used in the prediction, then CV(h) would be trivially mini- 
mized when m;,(x;) = y;,i = 1,..., N. CV(h) is also called the estimated prediction 
error. 

Hardle and Marron (1985) showed that minimizing CV(/) is asymptotically equiv- 
alent to minimizing a modification of ISE(A) and MISE(A). The modification includes 
weight function (xo) in the integrand, as well as the averaged squared error (ASE) 
N7! Ym) — m(x;:)) 7 (xi), which is a discrete sample approximation to ISE(h). 
The measure CV (h) converges at the slow rate of O(N~°-') however, so CV(A) can be 
quite variable in finite samples. 


Generalized Cross- Validation 


An alternative to leave-one-out cross validation is to use a measure similar to CV(h) 
but one that more simply uses m(x;) rather than m_;(x;) and then adds a model com- 
plexity penalty that increases as the bandwidth h decreases. This leads to 


N 
PV(h) = X Oi — MCX) (Xi) pl Win), 


i=l 
where p(-) is the penalty function and w;; n is the weight given to the ith observation 
in M(x) = Oj Wjiny;- 

A popular example is the generalized cross-validation measure that uses the 
penalty function p(wiin) = (1 — Wiin)”. Other penalties are given in Härdle (1990, 
p. 167) and Härdle and Linton (1994, p. 2323). 
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Cross- Validation Example 


For the local running average example in Section 9.4.2, CV(k) = 54,811, 56,666, 
63,456, 65,605, and 69,939 for k = 3, 5, 7, 9, and 25, respectively. In this case all 
observations were used to calculate CV(k), with m(x;) = 1, despite possible end-point 
problems. There is no real gain after k = 5, though from Figure 9.5 this value pro- 
duced too rough an estimate and in practice one would choose a higher value of k to 
get a smoother curve. 

More generally cross-validation is by no means perfect and it is common to “eye- 
ball” fitted nonparametric curves to select h to achieve a desired degree of smoothness. 


Trimming 


The denominator of the kernel estimator in (9.21) is F (x0), the kernel estimate of the 
density of the regressor at x9. At some evaluation points F(x) can be very small, 
leading to a very large estimate m(x;). Trimming eliminates or greatly downweights 
all points with f(x) < b, say, where b — 0 at an appropriate rate as N — oo. Such 
problems are most likely to occur in the tails of the distribution. For nonparametric 
estimation one can just focus on estimation of m(x;) for more central values of x;, and 
values in the tails may be downweighted in cross-validation. However, the semipara- 
metric methods of Section 9.7 can entail computation of m(x;) at all values of x;, in 
which case it is not unusual to trim. Ideally, the trimming function should make no 
difference asymptotically, though it will make a difference in finite samples. 


9.5.4. Confidence Intervals 


Kernel regression estimates should generally be presented with pointwise confidence 
intervals. A simple procedure is to present pointwise confidence intervals for f (xo) 
evaluated at, for example, xo equal to the first through ninth deciles of x. 

If the bias b(xo) in m(xo) is ignored, (9.24) yields the following 95% confidence 
interval: 


Shenoy rie ER frora 
m(xo) E M(xXo9) = I. Nh F(x) Z Z, 


where 6? = Day Wio,n€; and wyo,, is defined in (9.22) and F (x0) is the kernel density 
estimate at x9. This estimate assumes homoskedastic errors, though is likely to be 
somewhat robust to heteroskedasticity since observations close to xo are given the 
greatest weight. Alternatively, from the discussion after (9.20) a heteroskedastic robust 
95% confidence interval is M(xo) + 1.9659, where 55 = >-; wa, er 

As in the kernel density case, the bias in m(xo) should not be ignored. As already 
noted, estimation of the bias is difficult. Instead, the standard procedure is to under- 
smooth, with smaller bandwidth h satisfying h = o(N~°*) rather than the optimal 
h* = ON 2), 
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Härdle (1990) gives a detailed presentation of confidence intervals, including uni- 
form confidence bands rather than pointwise intervals, and the bootstrap methods given 
in Section 11.6.5. 


9.5.5. Derivative Estimation 


In regression we are often interested in how the conditional mean of y changes with 
changes in x, the marginal effect, rather than the conditional mean per se. 

Kernel estimates can be easily used to form the derivative. The general result is that 
the sth derivative of the kernel regression estimate, m?(xo), is consistent for m“(x9), 
the sth derivative of the conditional mean m(xo). Either calculus or finite-difference 
approaches can be taken. 

As an example, consider estimation of the first derivative in the generated-data 
example of the previous section. Let z1,..., zy denote the ordered points at which 
the kernel regression function is evaluated and m(z,), ..., 7(zy) denote the estimates 
at these points. A finite-difference estimate is m’(z;) = [m(z;) — m(z;—1)]/[zi — Zi—1). 
This is plotted in Figure 9.7, along with the true derivative, which for the dgp given 
in (9.17) is the quadratic m’(z;) = 6.5 — 0.30z; + 0.003z?. As expected the derivative 
estimate is somewhat noisy, but it picks up the essentials. Derivative estimates should 
be based on oversmoothed estimates of the conditional mean. For further details see 
Pagan and Ullah (1999, chapter 4). Hardle (1990, p. 160) presents adaptation of cross- 
validation to derivative estimation. 

In addition to the local derivative m’(xo) we may also be interested in the average 
derivative E[m’(x)]. The average derivative estimator given in Section 9.7.4 provides 
a v N-consistent and asymptotically normal estimate of E[m’(x)]. 


9.5.6. Conditional Moment Estimation 


The kernel regression methods for the conditional mean E[y|x] = m(x) can be ex- 
tended to nonparametric estimation of other conditional moments. 


Nonparametric Derivative Estimation 


From Lowess (k=25) 


eudsaiides tat From OLS Cubic Regression 


Dependent variable y 


T 
0 20 40 60 80 100 
Regressor x 


Figure 9.7: Nonparametric derivative estimate using previously estimated Lowess re- 
gression curve, as well as using a cubic regression curve. Same generated data as 
Figure 9.5. 
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For raw conditional moments such as E[ y“|x] we use the weighted average 


y*\xol = 3 Win Vis (9.28) 


where the weights w,o,, may be the same weights as used for estimation of m(xo). 

Central conditional moments can then be computed by reexpressing them as 
weighted sums of raw moments. For example, since V[y|x] = ELy?|x] — (ELy|x])’, the 
conditional variance can be estimated by PI y?lxo] — M(xo)?. One expects that higher 
order conditional moments will be estimated with more noise than will be the condi- 
tional mean. 


9.5.7. Multivariate Kernel Regression 


We have focused on kernel regression on a single regressor. For regression of scalar y 
on k-dimensional vector x, that is, y; = m(x;) + €; = m(x1;,..., Xki) + £i, the kernel 
estimator of m(xo) becomes 


wi yey K (=) yi 

waF Lint K (*5*) 
where K(-) is now a multivariate kernel. Often K(-) is the product of k one- 
dimensional kernels, though multivariate kernels such as the multivariate normal den- 
sity can be used. 

If a product kernel is used the regressors should be transformed to a common scale 
by dividing by the standard deviation. Then the cross-validation measure (9.25) can 
be used to determine a common optimal bandwidth h*, though determining which x; 
should be downweighted as the result of closeness to the end points is more compli- 
cated when x is multivariate. Alternatively, regressors need not be rescaled, but then 
different bandwidths should be used for each regressor. 

The asymptotic results and expressions are similar to those considered before, as the 
estimate is again a local average of the y;. The bias b(xo) is again O(h?) as before, but 
the variance of Mm(xo) declines at a rate O(Nh*), slower than in the one-dimensional 
case since essentially a smaller fraction of the sample is being used to form m(xo). 
Then 


m(xo) = 


’ 


2 
~V Nht (M(X) — m(X0) — b(X0)) + lo, Ce fka], 
fo) 


The optimal bandwidth choice is h* = O(N~'/“*), which is larger than O(N 7°?) in 
the one-dimensional case. The corresponding optimal rate of convergence of M(Xo) is 
N72/&+4) 

This result and the earlier scalar result assumes that m(x) is twice differentiable, a 
necessary assumption to obtain the bias term in (9.23). If m(x) is instead p times dif- 
ferentiable then kernel estimation using a pth order kernel (see Section 9.3.3) reduces 
the order of the bias, leading to smaller h* and faster rates of convergence that attain 
Stone’s bound given in Section 9.4.5; see Hardle (1990, p. 93) for further details. Other 
nonparametric estimators given in the next section can also attain Stone’s bound. 
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The convergence rate decreases as the number of regressors increases, approaching 
N° as the number of regressors approaches infinity. This curse of dimensionality 
greatly restricts the use of nonparametric methods in regression models with several 
regressors. Semiparametric models (see Section 9.7) place additional structure so that 
the nonparametric components are of low dimension. 


9.5.8. Tests of Parametric Models 


An obvious test of correct specification of a parametric model of the conditional mean 
is to compare the fitted mean with that obtained from a nonparametric model. 

Let 7g(x) denote a parametric estimator of EL y|x] and m),(x) denote a nonparamet- 
ric estimator such as a kernel estimator. One approach is to compare i7ig(x) with ™m),(x) 
at a range of values of x. This is complicated by the need to correct for asymptotic 
bias in m;(x) (see Härdle and Mammen, 1993). A second approach is to consider con- 
ditional moment tests of the form N7! >; wii — Mo(x;)), where different weights, 
based in part on kernel regression, test failure of E[y|x] = mg(x) in different direc- 
tions. For example, Horowitz and Härdle (1994) use w; = mp(x;) — mo(x;). Pagan 
and Ullah (1999, pp. 141-150) and Yatchew (2003, pp. 119-124) survey some of the 
methods used. 


9.6. Alternative Nonparametric Regression Estimators 


Section 9.4 introduced local regression methods that estimate the regression function 
m(xo) by a local weighted average m(xo) = `; wio,nyi, where the weights wio,, = 
w(x;, xo, h) differ with the point of evaluation xo and the sample value of x;. Section 
9.5 presented detailed results when the weights are kernel weights. 

Here we consider other commonly used local estimators that correspond to other 
weights. Many of the results of Section 9.5 carry through, with similar optimal rates 
of convergence and use of cross-validation for bandwidth selection, though the exact 
expressions for bias and variance differ from those in (9.23) and (9.24). The estimators 
given in Section 9.6.2 are especially popular. 


9.6.1. Nearest Neighbors Estimator 


The k-nearest neighbor estimator is the equally weighted average of the y values for 
the k observations of x; closest to x9. Define Nz(xo) to be the set of k observations of 
x; Closest to x9. Then 


A 1< 
Mir-nn (x0) = z DM € Neo). (9.29) 


i=l 


This estimator is a kernel estimator with uniform weights (see Table 9.1) except that 
the bandwidth is variable. Here the bandwidth ho at xo equals the distance between 
xo and the furthest of the k nearest neighbors, and more formally ho ~ k/(2N f (xo)). 
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The quantity k/N is called the span. Smoother curves can be obtained by using kernel 
weights in (9.29). 

The estimator has the attraction of providing a simple rule for variable bandwidth 
selection. It is computationally faster to use a symmetrized version that uses the k/2 
nearest neighbors to the left and a similar number to the right, which is the local run- 
ning average method used in Section 9.4.2. Then one can use an updating formula on 
observations ordered by increasing x;, as then one observation leaves the data and one 
enters as xo increases. 


9.6.2. Local Linear Regression and Lowess 


The kernel regression estimator is a local constant estimator because it assumes that 
m(x) equals a constant in the local neighborhood of xo. Instead, one can let m(x) be 
linear in the neighborhood of xo, so that m(x) = ao + bo(x — xo) in the neighborhood 
of XQ. 

To implement this idea, note that the kernel regression estimator m(xo) can be ob- 
tained by minimizing }°; K ((x; — xo)/ h) (i — mo)? with respect to mo. The local 
linear regression estimator minimizes 


2 Xi — X0 2 
K( 7, Jor ao — bo(x; — X0))’, (9.30) 
=I 


l 


with respect to ao and bo, where K(-) is a kernel weighting function. Then m(x) = 
To + bolx — xq) in the neighborhood of xo. The estimate at exactly xo is then m(x) = 
do, and bo provides an estimate of the first derivative m'(xo). More generally, a local 
polynomial estimator of degree p minimizes 


ul Xi — XO (xi — Xo)? 3 
K 7, (Yi — 40,0 — 40,1 (Xi — x0) — +++ — ao,p p! J (9.31) 


l 


yielding MO’ (x0) = o,s. 

Fan and Gijbels (1996) list many properties and attractions of this method. Esti- 
mation entails only weighted least-squares regression at each evaluation point xo. The 
estimators can be expressed as a weighted average of y;, since they are LS estimators. 
The local linear estimator has bias term b(xo) = h? ($m"(xo)) f z?K (z)dz, which, un- 
like the bias for kernel regression given in (9.23), does not depend on m’(xo). This 
is especially beneficial for overcoming the boundary problems illustrated in Section 
9.4.2. For estimating an sth-order derivative a good choice of p is p = s + 1 so that, 
for example, one uses a local quadratic estimator to estimate the first derivative. 

A standard local regression estimator is the locally weighted scatterplot smoothing 
or Lowess estimator of Cleveland (1979). This is a variant of local polynomial estima- 
tion that in (9.31) uses a variable bandwidth ho, determined by the distance from xo to 
its kth nearest neighbor; uses the tricubic kernel K(z) = (70/81). — |z|?)?1(|z| < 1); 
and downweights observations with large residuals y; — m(x;), which requires passing 
through the data N times. For a summary see Fan and Gijbels (1996, p. 24). Lowess 
is attractive compared to kernel regression as it uses a variable bandwidth, robustifies 
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against outliers, and uses a local polynomial estimator to minimize boundary prob- 
lems. However, it is computationally intensive. 

Another popular variation is the supersmoother of Friedman (1984) (see Hardle, 
1990, p. 181). The starting point is symmetrized k-NN, using local linear fit rather than 
local constant fit for better fit at the boundary. Rather than use a fixed span or fixed 
k, however, the supersmoother is a variable span smoother where the variable span is 
determined by local cross-validation that entails nine passes over the data. Compared 
to Lowess the supersmoother does not robustify against outliers, but it permits the span 
to vary and is fast to compute. 


9.6.3. Smoothing Spline Estimator 


The cubic smoothing spline estimator (x) minimizes the penalized residual sum 
of squares 


N 
PRSS(A) = $ Oi — mE +A / (m'"(x)ydx, (9.32) 
i=l 
where à is a smoothing parameter. As elsewhere in this chapter squared error loss is 
used. The first term alone leads to a very rough fit since then m(x;) = y;. The second 
term is introduced to penalize roughness. The cross-validation methods of Section 
9.5.3 can be used to determine A, with larger values of à leading to a smoother curve. 

Härdle (1990, pp. 56-65) shows that m,(x) is a cubic polynomial between succes- 
sive x-values and that the estimator can be expressed as a local weighted average of 
the ys and is asymptotically equivalent to a kernel estimator with a particular variable 
kernel. In microeconometrics smoothing splines are used less frequently than the other 
methods presented here. The approach can be adapted to other roughness penalties and 
other loss functions. 


9.6.4. Series Estimators 


Series estimators approximate a regression function by a weighted sum of K functions 
z1(x),...,2K(%), 


K 
mx(x) = J Pza, (9.33) 

j=l 
where the coefficients Bis ohg are simply obtained by OLS regression of y on 
zı(x),...,zg(x). The functions zı(x), ...,zg(x) form a truncated series. Examples 
include a (K — 1)th-order polynomial approximation or power series with z;(x) = 
xJ-!, j=1,..., K; orthogonal and orthonormal polynomial variants (see Section 


12.3.1); truncated Fourier series where the regressor is rescaled so that x € [0, 27]; 
the Fourier flexible functional form of Gallant (1981), which is a truncated Fourier 
series plus the terms x and x”; and regression splines that approximate the regres- 
sion function m(x) by polynomial functions between a given number of knots that are 
joined at the knots. 
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The approach differs from that in Section 9.4 as it is a global approximation ap- 
proach to estimation of m(x), rather than a local approach to estimation of m(xo). 
Nonetheless, M g(x) > m(xo) if K —> oo at an appropriate rate as N — oo. From 
Newey (1997) if x is k dimensional and m(x) is p times differentiable the mean in- 
tegrated squared error (see Section 9.5.3) MISE(h) = O(K~2?/* + K/N), where the 
first term reflects bias and the second term variance. Equating these gives the optimal 
K* = N¥/2P+) so K grows but at slower rate than the sample size. The convergence 
rate of mx~(x) equals the fastest possible rate of Stone (1980), given in Section 9.4.5. 
Intuitively, series estimators may not be robust as outliers may have a global rather 
than merely local impact on m(x), but this conjecture is not tested in typical examples 
given in texts. 

Andrews (1991) and Newey (1997) give a very general treatment that includes 
the multivariate case, estimation of functionals other than the conditional mean, 
and extensions to semiparametric models where series methods are most often 
used. 


9.7. Semiparametric Regression 


The preceding analysis has emphasized regression models without any structure. In 
microeconometrics some structure is usually placed on the regression model. 

First, economic theory may place some structure, such as symmetry and homo- 
geneity restrictions, in a demand function. Such information may be incorporated into 
nonparametric regression; see, for example, Matzkin (1994). 

Second, and more frequently, econometric models include so many potential regres- 
sors that the curse of dimensionality makes fully nonparametric analysis impractical. 
Instead, it is common to estimate a semiparametric model that loosely speaking com- 
bines a parametric component with a nonparametric component; see Powell (1994) for 
a careful discussion of the term semiparametric. 

There are many different semiparametric models and myriad methods are often 
available to consistently estimate these models. In this section we present just a few 
leading examples. Applications are given elsewhere in this book, including the binary 
outcome models and censored regression models given in Chapters 14 and 16. 


9.7.1. Examples 


Table 9.2 presents several leading examples of semiparametric regression. The first 
two examples, detailed in the following, generalize the linear model x’3 by adding 
an unspecified component A(z) or by permitting an unspecified transformation g(x’), 
whereas the third combines the first two. The next three models, used more in ap- 
plied statistics than econometrics, reduce the dimensionality by assuming additivity 
or separability of the regressors but are otherwise nonparametric. We detail the gen- 
eralized additive model. Related to these are neural network models; see Kuan and 
White (1994). The last example, also detailed in the following, is a flexible model of 
the conditional variance. Care needs to be taken to ensure that semiparametric models 
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Table 9.2. Semiparametric Models: Leading Examples 


Name Model Parametric Nonparametric 
Partially linear ELy|x, z] = x’8 + A(z) B AC) 
Single index ELy|x] = g(x’B) B gC) 
Generalized partial E[y|x, z] = g(x’G + A(z)) B g(-),AC- ) 
linear 
Generalized additive E[y r = DDE gj (xj) - gj) 
Partial additive ELy|x, z] = x'B + c+ Sorea gj(Z;) B gj) 
Projection pursuit E[y|x = re 1 8j(x3;) Bj gC) 
Heteroskedastic E[y|x] = x'8; VLy|x] = o7(x) B a(-) 


linear 


are identified. For example, see the discussion of single-index models. In addition to 
estimation of 8, interest also lies in the marginal effects such as dE[y|x, z]/0x. 


9.7.2. Efficiency of Semiparametric Estimators 


We consider loss of efficiency in estimating by semiparametric rather than parametric 
methods, ahead of presenting results for several leading semiparametric models. 

Our summary follows Robinson (1988b), who considers a semiparametric model 
with parametric component denoted @ and nonparametric component denoted G that 
depends on infinitely many nuisance parameters. Examples of G include the shape of 
the distribution of a symmetrically distributed iid error and the single-index function 
g(-) given in (9.37) in Section 9.7.4. The estimator B= B), where G isa nonpara- 
metric estimator of G. 

Ideally, the estimator B is adaptive, meaning that there is no efficiency loss in 
having to estimate G by nonparametric methods, so that 


VN@ -— B) 4 NIO, Vel. 


where Vg is the covariance matrix for any shape function G in the particular class be- 
ing considered. Within the likelihood framework Vg is the Cramer—Rao lower bound. 
In the second-moment context Vg is given by the Gauss—Markov theorem or a gener- 
alization such as to GMM. A leading example of an adaptive estimator is estimation 
with specified conditional mean function but with unknown functional form for het- 
eroskedasticity (see Section 9.7.6). 

If the estimator 8 is not adaptive then the next best optimality property is for the 
estimator to attain the semiparametric efficiency bound V~, so that 


VNĜ — B) Š NTO, V4], 


where V% is a generalization of the Cramer—Rao lower bound or its second-moment 
analogue that provides the smallest variance matrix possible given the specified 
semiparametric model. For an adaptive estimator Vý% = Vg, but usually Vý exceeds 
Vg. Semiparametric efficiency bounds are introduced in Section 9.7.8. They can be 
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obtained only in some semiparametric settings, and even when they are known no 
estimator may exist that attains the bound. An example that attains the bound is the 
binary choice model estimator of Klein and Spady (1993) (see Section 14.7.4). 

If the semiparametric efficiency bound is not attained or is not known, then the next 
best property is that VN (B — B) SN [0, V@'] for V% greater than Vg, which permits 
the usual statistical inference. More generally, J/N(B — B) =O p(1) but is not neces- 
sarily normally distributed. Finally, consistent but less than V N-consistent estimators 
have the property that N (B — B) = O,(1), where r < 0.5. Often asymptotic normal- 
ity cannot be established. This often arises when the parametric and nonparametric 
parts are treated equally, so that maximization occurs jointly over G and G. There are 
many examples, particularly in discrete and truncated choice models. 

Despite their potential inefficiency, semiparametric estimators are attractive because 
they can retain consistency in settings where a fully parametric estimator is inconsis- 
tent. Powell (1994, p. 2513) presents a table that summarizes the existence of consis- 
tent and //N-consistent asymptotic normal estimators for a range of semiparametric 
models. 


9.7.3. Partially Linear Model 


The partially linear model specifies the conditional mean to be the usual linear re- 
gression function plus an unspecified nonlinear component, so 


ELy|x, z] = xB + A(z), (9.34) 


where the scalar function A(-) is unspecified. 

An example is the estimation of a demand function for electricity, where z reflects 
time-of-day or weather indicators such as temperature. A second example is the sample 
selection model given in Section 16.5. Ignoring A(z) leads to inconsistent 8 owing to 
omitted variables bias, unless Cov[x, A(z)] = 0. In applications interest may lie in 8, 
A(z) or both. Fully nonparametric estimation of E[y|x, z] is possible but leads to less 
than ./N-consistent estimation of B. 


Robinson Difference Estimator 
Instead, Robinson (1988a) proposed the following method. The regression model 
implies 

y=xßB +A) +u, 
where the error u = y — E[y|x, z]. This in turn implies 
E[y|z] = E[x|z]' B + A(z) 

since E[u|x, z] = 0 implies E[u|z] = 0. Subtracting the two equations yields 

y — Ely|z] = (x — Elx|z] 8 + u. (9.35) 


The conditional moments in (9.35) are unknown, but they can be replaced by nonpara- 
metric estimates. 
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Thus Robinson proposed the OLS regression estimation of 
Yi — yi = (K— xV B + v, (9.36) 


where my; and m,; are predictions from nonparametric regression of, respectively, y; 
and x; on z;. Given independence over i, the OLS estimator of G in (9.36) is v N 
consistent and asymptotically normal with 


ue 1 & rs 
VN(@p. — 6) N | 0,07 (pim y Doki — Elxilai Da; — Elx; zn) ; 
iel 


assuming u; is iid [0, 0]. Not specifying A(z) generally leads to an efficiency loss, 
though there is no loss if E[x|z] is linear in z. To estimate ViBpx] simply replace 
(x; -E[x;|z;]) by (x; — m,;). The asymptotic result generalizes to heteroskedastic er- 
rors, in which case one just uses the usual Eicker—White standard errors from the OLS 
regression (9.36). Since A(z) = E[y|z] — E[x|z]'@ it can be consistently estimated by 
A(z) = My; — fxi P. 

A variety of nonparametric estimators m yi and mx; can be used. Robinson (1988a) 
used kernel estimates that require convergence at rate no slower than N~'/* so that 
oversmoothing or higher order kernels are needed if the dimension of z is large; see 
Pagan and Ullah (1999, p. 205). Note also that the kernel estimators may be trimmed 
(see Section 9.5.3). 


Other Estimators 


Several other methods lead to /N-consistent estimates of 6 in the partially linear 
model. Speckman (1988) also used kernels. Engle et al. (1986) used a generalization 
of the cubic smoothing spline estimator. Andrews (1991) presented regression of y on 
x and a series approximation for A(z) given in Section 9.6.4. Yatchew (1997) presents 
a simple differencing estimator. 


9.7.4. Single-Index Models 


A single-index model specifies the conditional mean to be an unknown scalar function 
of a linear combination of the regressors, with 


E[y|x] = g(x), (9.37) 


where the scalar function g(-) is unspecified. The advantages of single-index models 
have been presented in Section 5.2.4. Here the function g(-) is obtained from the data, 
whereas previous examples specified, for example, E[y|x] = exp(x’@). 


Identification 


Ichimura (1993) presents identification conditions for the single-index model. For 
unknown function g(-) the single-index model 8 is only identified up to location and 
scale. To see this note that for scalar v the function g*(a + bv) can always be expressed 
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as g(v), so the function g*(a + bx’) is equivalent to g(x’). Additionally, g(-) must 
be differentiable. In the simplest case all regressors are continuous. If instead some 
regressors are discrete, then at least one regressor must be continuous and if g(-) is 
monotonic then bounds can be obtained for (3. 


Average Derivative Estimator 


For continuous regressors, Stoker (1986) observed that if the conditional mean is single 
index then the vector of average derivatives of the conditional mean determines @ up 
to scale, since for m(x;) = g(x; 6) 


ə 
5=E | | = Ele'(x’B)1G, (9.38) 


and E[g’(x;3)] is a scalar. Furthermore, by the generalized information matrix equal- 
ity given in Section 5.6.3, for any function h(x), E[dh(x)/dx] = —E[h(x)s(x)], where 
s(x) = dI1n f(x)/dx =f’(x)/f(x) and f(x) is the density of x. Thus 


6 = —E[m(x)s(x)] = —E[E[y|x]s(x)]. (9.39) 


It follows that 6, and hence 6 up to scale, can be estimated by the average derivative 
(AD) estimator 


we rene +2. 
bap = -7 = yiS(Xi), (9.40) 


where s(x;) = F(x) ITED can be obtained by kernel estimation of the density of x; 
and its first derivative. The estimator 6 is VN consistent and its asymptotic normal 
distribution was derived by Hardle and Stoker (1989). The function g(-) can be esti- 
mated by nonparametric regression of y; on xô. Note that dap provides an estimate 
of E[m’ (x) regardless of whether a single-index model is relevant. 

A weakness of 6 Ap is that ’s(x;) can be very large if F (xi) is small. One possibility is 
to trim when F (x;) is small. Powell, Stock, and Stoker (1989) instead observed that the 
result (9.38) extends to weighted derivatives with 6 = E[w(x)m’(x)]. Especially con- 
venient is to choose w(x) = f(x), which yields the density weighted average deriva- 
tive (DWAD) estimator 


A Te ocd 
Spwan = — 57 2 iS E, (9.41) 
i=l 


which no longer divides by F(x;). This yields a /N-consistent and asymptotically 
normal estimate of 6 up to scale. For example, if the first component of 8 is normalized 
to one then By =i and B; = 3;/d for j > 1. 

These methods require continuous regressors so that the derivatives exist. Horowitz 
and Hardle (1996) present extension to discrete regressors. 
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Semiparametric Least Squares 


An alternative estimator of the single-index model was proposed by Ichimura (1993). 
Begin by assuming that g(-) is known, in which case the WLS estimator of G 
minimizes 


1 N 
SnB) =7 Dwi Oi = BRB. 
i=l 


For unknown g(-) Ichimura proposed replacing g(x;3) by a nonparametric estimate 
a(x; 6B), leading to the weighted semiparametric least-squares (WSLS) estimator 
Bweszs that minimizes 


jer a 
OnB) = Dmx wii — BOB)”, 
i=1 


where z(x;) is a trimming function that drops observations if the kernel regression 
estimate of the scalar x’ is small, and g(x; 6) is a leave-one-out kernel estimator 
from regression of y; on x,. This is a /N-consistent and asymptotically normal 
estimate of 8 up to scale that is generaly more efficient than the DWAD estimator. For 
heteroskedastic data the most efficient estimator is the analogue of feasible GLS that 
uses estimated weight function #;(x) = 1/47, where G? is the kernel estimate given 
in (9.43) of Section 9.7.6 and where t; = y; — B(x; 3) and @ is obtained from initial 
minimization of Oy(Q) with w;(x) = 1. 

The WSLS estimator is computed by iterative methods. Begin with an initial esti- 
mator B: such as the DWAD estimator with first component normalized to one. Form 
the kernel estimate FB”) and hence Q sB), perturb B” to obtain the gradient 


en(B) = 9Qn(B)/dBlan and hence an update g” = a” + Angs B”), and so 
on. This estimator is considerably more difficult to calculate than the DWAD estima- 


tor, especially as Q() can be nonconvex and multimodal. 


9.7.5. Generalized Additive Models 


Generalized additive models specify E[y|x] = gı(xı) +---+gx(xx), a specializa- 
tion of the fully nonparametric model E[y|x] = g(xı, ..., 8). This specialization re- 
sults in the estimated subfunctions g;(x;) converging at the rate for a one-dimensional 
nonparametric regression rather than the slower rate of a k-dimensional nonparametric 
regression. 

A well-developed methodology exists for estimating such models (see Hastie and 
Tibsharani, 1990). This is automated in some statistical packages such as S-Plus. Plots 
of the estimated subfunctions g;(x;) on x; trace out the marginal effects of x; on 
E[y|x], so the additive model can provide a useful tool for exploratory data analy- 
sis. The model sees little use in microeconometrics in part because many applications 
such as censoring, truncation, and discrete outcomes lead naturally to single-index and 
partially linear models. 
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9.7.6. Heteroskedastic Linear Model 


The heteroskedastic linear model specifies 


ELy|x] = x’B, 
VIy|x] = o”), 


where the variance function o?(-) is unspecified. 

The assumption that errors are heteroskedastic is the standard cross-section data 
assumption in modern microecometrics. One can obtain consistent but inefficient esti- 
mates of 3 by doing OLS and using the Eicker-White heteroskedastic-consistent esti- 
mate of the variance matrix of the OLS estimator. Cragg (1983) and Amemiya (1983) 
proposed an IV estimator that is more efficient than OLS but still not fully efficient. 
Feasible GLS provides a fully efficient second-moment estimator but is not attractive 
as it requires specification of a functional form for o7(x) such as o7(x) = exp(x’7). 

Robinson (1987) proposed a variant of FGLS using a nonparametric estimator of 
o? = o° (x;). Then 


N TI /N 
Bum = (È ax (> arus) ; (9.42) 
i=l i=l 
where Robinson (1987) used a k-NN estimator of o? with uniform weight, so 
1 
mer So 10K; E€ NEDAS, (9.43) 
j=l 


where t; = yi — X; Bors is the residual from first-stage OLS regression of y; on x; and 
N;(x;) is the set of k observations of x; closest to x; in weighted Euclidean norm. Then 


-1 
D d z 1 y —2 1 
VN (Bum — B) > NTO, N | 0, (pim N 2 Oo *(X;)X;X; > 


assuming u; is iid [0, 07(x;)]. This estimator is adaptive as it attains the Gauss- 
Markov bound so is as as efficient as the GLS estimator when ø? is known. The 
variance matrix is consistently estimated by (N TESA Exx) 

In principle other nonparametric estimators of o?(x;) might be used, but Carroll 
(1982) and others originally proposed use of a kernel estimator of o? and found that 
proof of efficiency was possible only under very restrictive assumptions on x;. The 
Robinson method extends to models with nonlinear mean function. 


9.7.7. Seminonparametric MLE 


Suppose y; is iid with specified density f(y;|x;, G). In general, misspecification of the 
density leads to inconsistent parameter estimates. Gallant and Nychka (1987) proposed 
approximating the unknown true density by a power-series expansion around the den- 
sity f(y|x, B). To ensure a positive density they actually use a squared power-series 
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expansion around f(y|x, 3), yielding 


2 
Rep a= ION fO Ix, B) (9.44) 
SpPEIaY f(ylz, B)dz 
where p(y|œ) is a pth order polynomial in y, œ is the vector of coefficients of the poly- 
nomial, and division by the denominator ensures that probabilities integrate or sum to 
one. The estimator of 8 and œ maximizes the log-likelihood yy Inh,(yi|x, 6, a). 
The approach generalizes immediately to multivariate y;. The estimator is called the 
seminonparametric maximum likelihood estimator because it is a nonparametric 
estimator that can be estimated in the same way as a maximum likelihood estimator. 
Gallant and Nychka (1987) showed that under fairly general conditions the estimator 
yields consistent estimates of the density if the order p of the polynomial increases 
with sample size N at an appropriate rate. 

This result provides a strong basis for using (9.44) to obtain a class of flexible dis- 
tributions for any particular data. The method is particularly simple if the polynomial 
series p(y|cx) is the orthogonal or orthonormal polynomial series (see Section 12.3.1) 
for the baseline density f(y|x, 6), as then the normalizing factor in the denominator 
can be simply constructed. The order of the polynomial can be chosen using infor- 
mation criteria, with measures that penalize model complexity more than AIC used in 
practice. Regular ML statistical inference is possible if one ignores the data-dependent 
selection of the polynomial order and assumes that the resulting density h,(y|x, B, œ) 
is correctly specified. An example of this approach for count data regression is given 
in Cameron and Johansson (1997). 


9.7.8. Semiparametric Efficiency Bounds 


Semiparametric efficiency bounds extend efficiency bounds such as Cramer—Rao or 
the Gauss—Markov theorem to cases where the dgp has a nonparametric component. 
The best semiparametric methods achieve this efficiency bound. 

We use to denote parameters we wish to estimate, which may include variance 
components such as o°, and 77 to denote nuisance parameters. For simplicity we con- 
sider ML estimation with a nonparametric component. 

We begin with the fully parametric case. The MLE (B, 7) maximizes £(3, n) = 
In L(G, n). Let 0 = (G, 7) and let Tgog be the information matrix defined in (5.43). 
Then /N (0 — 0) 4 NTO, Taol. For /N (B — Q), partitioned inversion of Zgg leads 
to 


V* = Zap — LenZ py Ino) ' (9.45) 


as the efficiency bound for estimation of G when n is unknown. There is an efficiency 
loss when 77 is unknown, unless the information matrix is block diagonal so that Zg, = 
0 and the variance reduces to Tag: 

Now consider extension to the nonparametric case. Suppose we have a paramet- 
ric submodel, say £o(3), that involves 8 alone. Consider the family of all possible 
parametric models £(3, 7) that nest £o(B) for some value of 7. The semiparametric 
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efficiency bound is the largest value of V* given in (9.45) over all possible parametric 
models £(3, 77), but this is difficult to obtain. 
Simplification is possible by considering 


Se = S8 — E[sg|Sy], 


where Sg denotes the score 0£/00, and Se is the score for 8 after concentrating out 
n. For finite-dimensional 77 it can be shown that E[N ~'S a8] = V*. Here 77 is instead 
infinite dimensional. Assume iid data and let sọ; denote the ith component in the sum 
that leads to the score sg. Begun et al. (1983) define the tangent set to be the set of all 
linear combinations of s,,;. When this tangent set is linear and closed the largest value 
of V* in (9.45) equals 


Q — (plim N~'Sg8, j = (E[s, aiD. 


The matrix Q is then the semiparametric efficiency bound. 

In applications one first obtains Sn = }_; Sn,- Then obtain E[sg,|Sy,], which may 
entail assumptions such as symmetry of errors that place restrictions on the class of 
semiparametric models being considered. This yields Sg; and hence Q. For more de- 
tails and applications see Newey (1990b), Pagan and Ullah (1999), and Severini and 
Tripathi (2001). 


9.8. Derivations of Mean and Variance of Kernel Estimators 


Nonparametric estimation entails a balance between smoothness (variance) and bias 
(mean). Here we derive the mean and variance of kernel density and kernel regression 
estimators. The derivations follow those of M. J. Lee (1996). 


9.8.1. Mean and Variance of Kernel Density Estimator 


Since x; are iid each term in the summation has the same expected value and 
EIF] = EL; K (5)] 


= [LK (532) f(x)dx. 


By change of variable to z = (x — xọ)/h so that x = x9 + hz and dx/dz = h we 
obtain 


EPa = | KOS + had 
A second-order Taylor series expansion of f(xo + hz) around f(xo) yields 
EL f (x0)] = S KAF) + f’@ohz + 5 f Cohzy}dz 


= f (xo) f K(z)dz + hf'(xo) f 2K (z)dz + 4h? f (xo) f 2K (dz. 


330 


9.8. DERIVATIONS OF MEAN AND VARIANCE OF KERNEL ESTIMATORS 


Since the kernel K(z) integrates to unity this simplifies to 
z ' oe 
EL f(a0)] — f(ao) = hf’ Gro) I eK ()dz + zh? f (x0) i z’ K (z)dz. 


If additionally the kernel satisfies f zK(z)dz = 0, assumed in condition (ii) in Section 
9.3.3, and second derivatives of f are bounded, then the first term on the right-hand 
side disappears, yielding E[f (xo)] — f (x0) = b(xo), where b(xo) is defined in (9.4). 

To obtain the variance of Fixo), begin by noting that if y; are iid then V[y] = 
N-'V[y] = N7'Ely?] — N-(E[y])?. Thus 


F X— Xo 2 x—xo 2 
ViFoo = FE [EK G8) |= r EGK ED. 
Now by change of variables and first-order Taylor series expansion 


E[(EK E] 


S EK @OAL C0) + f ohz}dz 
1 F(x) f K(z°'dz + f'(xo) S 2K (z)°dz. 


It follows that 
VIF] = ant (xo) S KCYdz + 4 fŒ) f 2K (edz 


— xl feo) + 5 fof KOA]. 
For h — 0 and N — œ this is dominated by the first term, leading to Equation (9.5). 


9.8.2. Distribution of Kernel Regression Estimator 


We obtain the distribution for regressors x; that are iid with density f(x). From Section 
9.5.1 the kernel estimator is a weighted average m(xo) = >=; Wio,nyi, Where the kernel 
weights wjo,, are given in (9.22). Since the weights sum to unity we have m(xo) — 
m(Xo) = >); Wio,n(yi — m(Xo)). Substituting (9.15) for y;, and normalizing by J/Nh 
as in the kernel density estimator case we have 


N 
VNI o0) — m(xo)) = VNA X wio mai) — mxo) + £7). (9.46) 
i=l 
One approach to obtaining the limit distribution of (9.46) is to take a second-order 
Taylor series expansion of m(x;) around xo. This approach is not always taken be- 
cause the weights w;o,} are complicated by the normalization that they sum to one (see 
(9.22)). 
Instead, we take the approach of Lee (1996, pp. 148-151) following Bierens (1987, 
pp. 106-108). Note that the denominator of the weight function is the kernel estimate 
of the density of xo, since F (xo) =(Nh)! >>; K (xi — X0)/h). Then (9.46) yields 


N 
SNIE) — m(xo)) = a 2 K (= - =) (m(x;) — m(x0) + £;) / F (x0). 


(9.47) 


We apply the Transformation Theorem (Theorem A.12) to (9.47), using F (xo) = 
f (xo) for the denominator, while several steps are needed to obtain a limit normal 
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distribution for the numerator: 


1 N Xi — Xo 
TE XK ( ; ) (m(x;) — m(xo) + €:) (9.48) 
i=l 


i< p 
-mL 2) ma-n + Do (25 2) ey 


Consider the first sum in (9.48); if a law of large numbers can be applied it converges 
in probability to its mean 


Eo 
=F fa (5 


— J/Nh Í a + hz) — m(xo)) E 


=) (mxi) — moa) (9.49) 


=) (m(x) — m(xo)) f(x )dx 


= /N i | K(z) (im (xo) + sit "cx)) (f (xo) + hzf'(xo)) dz 
= J/Nh | / K (z)h?z?m' (x0) f'(xo)dz + / K()5 ee "ao flaadae} 


= J/Nhh* (nevra + 5" (v0) fao) i; z’ K(z)dz 
= V Nh f (x0)b(x0), 


where b(xo) is defined in (9.23). The first equality uses x; iid; the second equality is 
change of variables to z = (x — xo)/ h; the third equality applies a second-order Taylor 
series expansion to m(xo + hz) and a first-order Taylor series expansion to f (xo + hz); 
the fourth equality follows because upon expanding the product to four terms, the two 
terms given dominate the others (see, e.g., Lee, 1996, p. 150). 

Now consider the second sum in (9.48); the terms in the sum clearly have mean 
zero, and the variance of each term, dropping subscript i, is 


Vv E (: =) e =E [e (=) e (9.50) 
R24 = yy 
= I Z ) Viel] feds 


ai I K? (2) V[elxo + hz] f rhod 


= Velosa f K? (z)dz, 


by change of variables to z = (x — xọ)/h with dx = hdz in the third-line term, and 
letting h — 0 to get the last line. It follows upon applying a central limit theorem that 


TE (== 7”) ei SN fo, Vielxol f (x0) f K? jae]. (9.51) 
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Combining (9.49) and (9.51), we have that  Nh(m(xo) — m(xo)) defined in (9.47) 
converges to 1/f (xo) times N [VNA f(x0)b(x), Vielxol f (a0) f K? (2) dz]. Division 
of the mean by f(xo) and the variance by f(xọ)? leads to the limit distribution given 
in (9.24). 


9.9. Practical Considerations 


All-purpose regression packages increasingly offer adequate methods for univariate 
nonparametric density estimation and regression. The programming language XPlore 
emphasizes nonparametric and graphical methods; details on many of the methods are 
provided at its Web site. 

Nonparametric univariate density estimation is straightforward, using a kernel den- 
sity estimate based on a kernel such as the Gaussian or Epanechnikov. Easily computed 
plug-in estimates for the bandwidth provide a useful starting point that one may then, 
say, halve or double to see if there is an improvement. 

Nonparametric univariate regression is also straightforward, aside from bandwidth 
selection. If relatively unbiased estimates of the regression function at the end points 
are desired, then local linear regression or Lowess estimates are better than kernel 
regression. Plug-in estimates for the bandwidth are more difficult to obtain and cross- 
validation is instead used (see Section 9.5.3) along with eyeballing the scatterplot with 
a fitted line. The degree of desired smoothness can vary with application. For nonpara- 
metric multivariate regression such eyeballing may be impossible. 

Semiparametric regression is more complicated. It can entail subtleties such as trim- 
ming and undersmoothing the nonparametric component since typically estimation 
of the parametric component involves averaging the nonparametric component. For 
such purposes one generally uses specialized code written in languages such as Gauss, 
Matlab, Splus, or XPlore. For the nonparametric estimation component considerable 
computational savings can be obtained through use of fast computing algorithms such 
as binning and updating; see, for example, Fan and Gijbels (1996) and Härdle and 
Linton (1994). 

All methods require at some stage specification of a bandwidth or window width. 
Different choices lead to different estimates in finite samples, and the differences can 
be quite large as illustrated in many of the figures in this chapter. By contrast, within 
a fully parametric framework different researchers estimating the same model by ML 
will all obtain the same parameter estimates. This indeterminedness is a detraction of 
nonparametric methods, though the hope is that in semiparametric methods at least the 
spillover effects to the parametric component of the model may be small. 


9.10. Bibliographic Notes 


Nonparametric estimation is well presented in many statistics texts, including Fan and Gijbels 
(1996). Ruppert, Wand, and Carroll (2003) present application of many semiparametric meth- 
ods. The econometrics books by Härdle (1990), M. J. Lee (1996), Horowitz (1998b), Pagan and 
Ullah (1999), and Yatchew (2003) cover both nonparametric and semiparametric estimation. 
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Pagan and Ullah (1999) is particularly comprehensive. Yatchew (2003) is oriented to the ap- 
plied econometrician. He emphasizes the partial linear and single-index models and practical 
aspects of their implementation such as computation of confidence intervals. 


9.3 


9.4 


9.5 


9.6 


9.6 


9.7 


9-1 


Key early references for kernel density estimation are Rosenblatt (1956) and Parzen (1962). 
Silverman’s (1986) is a classic book on nonparametric density estimation. 

A quite general statement of optimal rates of convergence for nonparametric estimators is 
given in Stone (1980). 

Kernel regression estimation was proposed by Nadaraya (1964) and Watson (1964). A 
very helpful and relatively simple survey of kernel and nearest-neighbors regression is by 
Altman (1992). There are many other surveys in the statistics literature. Hardle (1990, chap- 
ter 5) has a lengthy discussion of bandwidth choice and confidence intervals. 

Many approaches to nonparametric local regression are contained in Stone (1977). For 
series estimators see Andrews (1991) and Newey (1997). 

For semiparametric efficiency bounds see the survey by Newey (1990b) and the more recent 
paper by Severini and Tripathi (2001). An early econometrics application was given by 
Chamberlain (1987). 

The econometrics literature focuses on semiparametric regression. Survey papers include 
those by Powell (1994), Robinson (1988b), and, at a more introductory level, Yatchew 
(1998). Additional references are given in elsewhere in this book, notably in Sections 14.7, 
15.11, 16.9, 20.5, and 23.8. The applied study by Bellemare, Melenberg, and Van Soest 
(2002) illustrates several semiparametric methods. 


Exercises 


Suppose we obtain a kernel density estimate using the uniform kernel (see 

Table 9.1) with h = 1 and a sample of size N = 100. Suppose in fact the data 

x~ N[O, 1]. 

(a) Calculate the bias of the kernel density estimate at x) = 1 using (9.4). 

(b) Is the bias large relative to the true value (1), where ¢(-) is the standard 
normal pdf? 

(c) Calculate the variance of the kernel density estimate at x) = 1 using (9.5). 

(d) Which is making a bigger contribution to MSE at x = 1, variance or bias 
squared? 

(e) Using results in Section 9.3.7, give a 95% confidence interval for the density 
at X = 1 based on the kernel density estimate F(1). 

(f) For this example, what is the optimal bandwidth h* from (9.10). 


Suppose we obtain a kernel regression estimate using a uniform kernel (see 
Table 9.1) with h = 1 and a sample of size N = 100. Suppose in fact the data 
x ~ NTO, 1] and the conditional mean function is m(x) = x°. 


(a) Calculate the bias of the kernel regression estimate at x) = 1 using (9.23). 

(b) Is the bias large relative to the true value m(1) = 1? 

(c) Calculate the variance of the kernel regression estimate at x) = 1 using 
(9.24). 

(d) Which is making a bigger contribution to MSE at x = 1, variance or bias 
squared? 
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(e) Using results in Section 9.5.4, give a 95% confidence interval for E[y|xo = 1] 
based on the kernel regression estimate ™/(1). 


9-3 This question assumes access to a nonparametric density estimation program. 
Use the Section 4.6.4 data on health expenditure. Use a kernel density estimate 
with Gaussian kernel (if available). 

(a) Obtain the kernel density estimate for health expenditure, choosing a suitable 
bandwidth by eyeballing and trial and error. State the bandwidth chosen. 

(b) Obtain the kernel density estimate for natural logarithm of health expenditure, 
choosing a suitable bandwidth by eyeballing and trial and error. State the 
bandwidth chosen. 

(c) Compare your answer in part (b) to an appropriate histogram. 

(d) If possible superimpose a fitted normal density on the same graph as the 
kernel density estimate from part (b). Do health expenditures appear to be 
log-normally distributed? 


9-4 This question assumes access to a kernel regression program or other non- 
parametric smoother. Use the complete sample of the Section 4.6.4 data 
on natural logarithm of health expenditure (y) and natural logarithm of total 
expenditure (x). 

(a) Obtain the kernel regression density estimate for health expenditure, choos- 
ing a good bandwidth by eyeballing and trial and error. State the bandwidth 
chosen. 

(b) Given part (a), does health appear to be a normal good? 

(c) Given part (a), does health appear to be a superior good? 

(d) Compare your nonparametric estimates with predictions from linear and 
quadratic regression. 
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CHAPTER 10 


Numerical Optimization 


10.1. Introduction 


Theoretical results on consistency and the asymptotic distribution of an estimator de- 
fined as the solution to an optimization problem were presented in Chapters 5 and 6. 
The more practical issue of how to numerically obtain the optimum, that is, how to 
calculate the parameter estimates, when there is no explicit formula for the estimator, 
comprises the subject of this chapter. 

For the applied researcher estimation of standard nonlinear models, such as logit, 
probit, Tobit, proportional hazards, and Poisson, is seemingly no different from es- 
timation of an OLS model. A statistical package obtains the estimates and reports 
coefficients, standard errors, t-statistics, and p-values. Computational problems gen- 
erally only arise for the same reasons that OLS may fail, such as multicollinearity or 
incorrect data input. 

Estimation of less standard nonlinear models, including minor variants of a standard 
model, may require writing a program. This may be possible within a standard statisti- 
cal package. If not, then a programming language is used. Especially in the latter case 
a knowledge of optimization methods becomes necessary. 

General considerations for optimization are presented in Section 10.2. Various iter- 
ative methods, including the Newton—Raphson and Gauss—Newton gradient methods, 
are described in Section 10.3. Practical issues, including some common pitfalls, are 
presented in Section 10.4. These issues become especially relevant when the opti- 
mization method fails to produce parameter estimates. 


10.2. General Considerations 


Microeconometric analysis is often based on an estimator 0 that maximizes a stochas- 
tic objective function Qy(@), where usually O solves the first-order conditions 
0Qy(0)/d0 = 0. A minimization problem can be recast as a maximization by mul- 


tiplying the objective function by minus one. In nonlinear applications there will 
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generally be no explicit solution to the first-order conditions, a nonlinear system of 
q equations in the q unknowns 0. 

A grid search procedure is usually impractical and iterative methods, usually gradi- 
ent methods, are employed. 


10.2.1. Grid Search 


In grid search methods, the procedure is to select many values of 0 along a grid, 
compute Q (0) for each of these values, and choose as the estimator 6 the value that 
provides the largest (locally or globally depending on the application) value of Qy(@). 

If a fine enough grid can be chosen this method will always work. It is generally 
impractical, however, to choose a fine enough grid without further restrictions. For 
example, if 10 parameters need to be estimated and the grid evaluates each parameter 
at just 10 points, a very sparse grid, there are 10!° or 10 billion evaluations. 

Grid search methods are nonetheless useful in applications where the grid search 
need only be performed among a subset of the parameters. They also permit viewing 
the response surface to verify that in using iterative methods one need not be concerned 
about multiple maxima. For example, many time-series packages do this for the scalar 
AR(1) coefficient in a regression model with AR(1) error. A second example is doing a 
grid search for the scalar inclusive parameter in a nested logit model (see Section 15.6). 
Of course, grid search methods may have to be used if nothing else works. 


10.2.2. Iterative Methods 


Virtually all microeconometric applications instead use iterative methods. These 
update the current estimate of 0 using a particular rule. Given an sth-round estimate 6, 
the iterative method provides a rule that yields a new estimate 8,1, where 6, denotes 
the sth-round estimate rather than the sth component of 6. Ideally, the new estimate is 
a move toward the maximum, so that O w(Os41) >Q v(0,), but in general this cannot 
be guaranteed. Also, gradient estimates may find a local maximum but not necessarily 
the global maximum. 


10.2.3. Gradient Methods 


Most iterative methods are gradient methods that change 6, in a direction determined 
by the gradient. The update formula is a matrix weighted average of the gradient 


O41 =O, 4+A8, 9 =1,...,S, (10.1) 


where A, is ag x q matrix that depends on ð., and 


_ 90n(6) 
a 90 h, 


(10.2) 


is the g x 1 gradient vector evaluated at 0s. Different gradient methods use differ- 
ent matrices A,;, detailed in Section 10.3. A leading example is the Newton—Raphson 
method, which sets A, = -H;!, where H, is the Hessian matrix defined later in (10.6). 


337 


NUMERICAL OPTIMIZATION 


Note that in this chapter A and g denote quantities that differ from those in other chap- 
ters. Here A is not the matrix that appears in the limit distribution of an estimator and 
g is not the conditional mean of y in the nonlinear regression model. 

Ideally, the matrix A, is positive definite for a maximum (or negative definite for 
a minimum), as then it is likely that Q w(Os41) > Q s@,). This follows from the first- 
order Taylor series expansion Q RC D) =0 v(0,) + PACIA 1— 8.) + R, where R is 
a remainder. Substituting in the update formula (10.1) yields 


On 41) — Ons) = gZA,g, + R, 


which is greater than zero if A, is positive definite and the remainder R is sufficiently 
small, since for a positive definite square matrix A the quadratic form x’Ax > 0 for 
all column vectors x Æ 0. Too small a value of A, leads to an iterative procedure that 
is too slow; however, too large a value of A, may lead to overshooting, even if As is 
positive definite, as the remainder term cannot be ignored for large changes. 

A common modification to gradient methods is to add a step-size adjustment to 
prevent possible overshooting or undershooting, so 


Osy = 0, + AAs Bos (10.3) 


where the stepsize a is a scalar chosen to maximize Q Oia). At the sth round 
first calculate A,g,, which may involve considerable computation. Then calculate 
Q KOI where 0 = 6, + AA,g, for a range of values of à (called a line search), 
and choose De as that à that maximizes Q KOI Considerable computational savings 
are possible because the gradient and A, are not recomputed along the line search. 

A second modification is sometimes made when the matrix A, is defined as the 
inverse of a matrix B,, say, so that A, = Bo. Then if B, is close to singular a matrix 
of constants, say C, is added or subtracted to permit inversion, so A; = (Bs + Cc): 
Similar adjustments can be made if A, is not positive definite. Further discussion of 
computation of A, is given in Section 10.3. 

Gradient methods are most likely to converge to the local maximum nearest the 
starting values. If the objective function has multiple local optima then a range of 
starting values should be used to increase the chance of finding the global maximum. 


10.2.4. Gradient Method Example 


Consider calculation of the NLS estimator in the exponential regression model when 
the only regressor is the intercept. Then E[y] = ef and a little algebra yields the gra- 
dient g = N7! Y; (y; — ef)e? = (9 — ef )e? . Suppose in (10.1) we use As = e 74s, 
which corresponds to the method of scoring variant of the Newton—Raphson algo- 
rithm presented later in Section 10.3.2. The iterative method simplifies to B, 41 = 
Bete e 

As an example of the performance of this algorithm, suppose y = 2 and the starting 
value is By = 0. This leads to the iterations listed in Table 10.1. There is very rapid 
convergence to the NLS estimate, which for this simple example can be analytically 
obtained as B = ln ý = In2 = 0.693147. The objective function increases throughout, 
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Table 10.1. Gradient Method Results 


Round Estimate Gradient Objective Function 
s B, 8s On(B,) = -3y Dili — PY 
1 0.000000 1.000000 1.500000 — >>; y?/2N 
2 1.000000 —1.952492 1.742036 — Y`; y?/2N 
3 0.735758 —0.181711 1.996210 — Y`; y?/2N 
4 0.694042 —0.003585 1.999998 — Y`, y?/2N 
5 0.693147 —0.000002 2.000000 — >>; y?/2N 


a consequence of use of the NR algorithm with globally concave objective function. 
Note that overshooting occurs in the first iteration, from Bı = 0.0 to Bo = 1.0, greater 
than B = 0.693. 

Quick convergence usually occurs when the NR algorithm is used and the objective 
function is globally concave. The challenge in practice is that nonstandard nonlinear 
models often have objective functions that are not globally concave. 


10.2.5. Method of Moments and GMM Estimators 


For m-estimators Qn(0)= N! ;q:i(0) and the gradient g(0)=N')>; 
ðqi(0)/30. 

For GMM estimators Q y (0) is a quadratic form (see Section 6.3.2) and the gradient 
takes the more complicated form 


g0) = La 5 amoy a0 x Wy x a O] . 


Some gradient methods can then no longer be used as they work only for averages. 
Methods given in Section 10.3 that can still be used include Newton-Raphson, steepest 
ascent, DFP, BFG, and simulated annealing. 

Method of moments and estimating equations estimators are defined as solving a 
system of equations, but they can be converted to a numerical optimization problem 
similar to GMM. The estimator @ that solves the q equations N~! >>; h;(0) = 0 can 
be obtained by minimizing Qy(0) = [N~! >>, h,(@)'[N~! >>, h;(0)]. 


10.2.6. Convergence Criteria 


Iterations continue until there is virtually no change. Programs ideally stop when all 
of the following occur: (1) A small relative change occurs in the objective function 
Q s@,); (2) a small change of the gradient vector g, occurs relative to the Hessian; 
and (3) a small relative change occurs in the parameter estimates 9. Statistical pack- 
ages typically choose default threshold values for these three changes, called conver- 
gence criteria. These values can often be changed by the user. A conservative value 
is 1076. 
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In addition there is usually a maximum number of iterations that will be 
attempted. If this maximum is reached estimates are typically reported. The estimates 
should not be used, however, unless convergence has been achieved. 

If convergence is achieved then a local maximum has been obtained. However, there 
is no guarantee that the global maximum is obtained, unless the objective function is 
globally concave. 


10.2.7. Starting Values 


The number of iterations is considerably reduced if the initial starting values 0; are 
close to 0. Consistent parameter estimates are obviously good estimates to use as start- 
ing values. A poor choice of starting values can lead to failure of iterative methods. In 
particular, for some estimators and gradient methods it may not be possible to compute 
gı or A, if the starting value is 6 =0. 

If the objective function is not globally concave it is good practice to use a range of 
starting values to increase the chance of obtaining a global maximum. 


10.2.8. Numerical and Analytical Derivatives 


Any gradient method by definition uses derivatives of the objective function. Either 
numerical derivatives or analytical derivatives may be used. 
Numerical derivatives are computed using 
AQn(@s) _ Qv@s +hej) — Ov@, — hej) 


= , j=1,...,q, 10.4 
Ad, JA J q (10.4) 


where h is small and e; = (0...0 1 0...0y is a vector with unity in the jth row and 
zeros elsewhere. 

In theory h should be very small, as formally ð Qn(0)/30; equals the limit of 
AQn(0)/A6; as h — 0. In practice too small a value of h leads to inaccuracy ow- 
ing to rounding error. For this reason calculations using numerical derivatives should 
always be done in double precision or quadruple precision rather than single precision. 
Although a program may use a default value such as h = 1076, other values will be 
better for any particular problem. For example, a smaller value of h is appropriate if the 
dependent variable y in NLS regression is measured in thousands of dollars rather than 
dollars (with regressors not rescaled), since then 0 will be one-thousandth the size. 

A drawback of using numerical derivatives is that these derivatives have to be com- 
puted many times — for each of the q parameters, for each of the N observations, and 
for each of the S iterations. This requires 2q N S evaluations of the objective function, 
where each evaluation itself may be computationally burdensome. 

An alternative is to use analytical derivatives. These will be more accurate than 
numerical derivatives and may be much quicker to compute, especially if the analytical 
derivatives are simpler than the objective function itself. Moreover, only q N S function 
evaluations are needed. 

For methods that additionally require calculation of second derivatives to form As 
there is even greater benefit to providing analytical derivatives. Even if just analyt- 
ical first derivatives are given, the second derivative may then be more quickly and 


340 


10.3. SPECIFIC METHODS 


accurately obtained as the numerical first derivative of the analytical first derivative. 
Statistical packages often provide the user with the option of providing analytical first 
and second derivatives. 

Numerical derivatives have the advantage of requiring no coding beyond providing 
the objective function. This saves coding time and eliminates one possible source of 
user error, though some packages have the ability to take analytical derivatives. 

If computational time is a factor or if there is concern about accuracy of calcula- 
tions, however, it is worthwhile going to the trouble of providing analytical derivatives. 
It is still good practice then to check that the analytical derivatives have been correctly 
coded by obtaining parameter estimates using numerical derivatives, with starting val- 
ues the estimates obtained using analytical derivatives. 


10.2.9. Nongradient Methods 


Gradient methods presume the objective function is sufficiently smooth to ensure ex- 
istence of the gradient. For some examples, notably least absolute deviations (LAD), 
quantile regression, and maximum score estimation, there is no gradient and alterna- 
tive iterative methods are used. 

For example, for LAD the objective function Qy(0,) = NT! >>; ly; — x; 6| has no 
derivative and linear programming methods are used in place of gradient methods. 
Such examples are sufficiently rare in microeconometrics that we focus almost exclu- 
sively on gradient methods. 

For objective functions that are difficult to maximize, particularly because of multi- 
ple local optima, use can be made of nongradient methods such as simulated annealing 
(presented in Section 10.3.8) and genetic algorithms (see Dorsey and Mayer, 1995). 


10.3. Specific Methods 


The leading method for obtaining a globally concave objective function is the Newton- 
Raphson iterative method. The other methods, such as steepest descent and DFP, are 
usually learnt and employed when the Newton—Raphson method fails. Another com- 
mon method is the Gauss-—Newton method for the NLS estimator. This method is 
not as universal as the Newton—Raphson method, as it is applicable only to least- 
squares problems, and it can be obtained as a minor adaptation of the Newton—Raphson 
method. These various methods are designed to obtain a local optimum given some 
starting values for the parameters. 

This section also presents the expectation method, which is particularly useful in 
missing data problems, and the method of simulated annealing, which is an example of 
a nongradient method and is more likely to yield a global rather than local maximum. 


10.3.1. Newton—Raphson Method 


The Newton—Raphson (NR) method is a popular gradient method that works espe- 
cially well if the objective function is globally concave in 0. In this method 


0.41 = 0, — H7 'g,, (10.5) 
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where g, is defined in (10.2) and 


_ PONO) 


H, = ———— 10.6 
s= 600" |, (10.6) 


is the q x q Hessian matrix evaluated at 0. These formulas apply to both maximiza- 
tion and minimization of Qy(@) since premultiplying Q x(0) by minus one changes 
the sign of both H7! and g,. 

To motivate the NR method, begin with the sth-round estimate 6, for 0. Then by 
second-order Taylor series expansion around 6, 


dOn@) 
a0’ 


3 Qn(9) 


On(9) = On(Os5) + 3030 


oe 24 x a 
(0 — 0,) + ~(0 — ,)' (0—O,)+ R. 
3, 2 a, 


Ignoring the remainder term R and using more compact notation, we approximate 


Qy(8) by 
x ~ 1 A A 
On (A) = On (0s) + g;(0 — 0s) + 38 — 0,)H,(0 — 5), 


where g, and H, are defined in (10.2) and (10.6). To maximize the approxima- 
tion Q7,(@) with respect to 0 we set the derivative to zero. Then g, + H,(@ — 8.) = 0, 
and solving for 0 yields Oe41 = 6, —H, lø., which is (10.5). The NR update therefore 
maximizes a second-order Taylor series approximation to Q (0) evaluated at 6. 

To see whether NR iterations will necessarily increase Q,(0), substitute the 
(s + 1)th-round estimate back into the Taylor series approximation to obtain 


Px Pa J ~ PY Pa ma 
On(9541) a On(95) = zs = 0;) Hs (0541 z 0;) +R. 


Ignoring the remainder term, we see that this increases (or decreases) if H, is negative 
(or positive) definite. At a local maximum the Hessian is negative semi-definite, but 
away from the maximum this may not be the case even for well-defined problems. If 
the NR method strays into such territory it may not necessarily move toward the max- 
imum. Furthermore the Hessian is then singular, in which case Hy! in (10.5) cannot 
be computed. Clearly, the NR method works best for maximization (or minimization) 
problems if the objective function is globally concave (or convex), as then H, is al- 
ways negative (or positive) definite. In such cases convergence often occurs within 
10 iterations. 

An additional attraction of the NR method arises if the starting value 0; is root-N 
consistent, that is, if VN (6, — 0o) has a proper limiting distribution. Then the second- 
round estimator 0, can be shown to have the same asymptotic distribution as the es- 
timator obtained by iterating to convergence. There is therefore no theoretical gain to 
further iteration. An example is feasible GLS, where initial OLS leads to consistent 
regression parameter estimates, and these in turn are used to obtain consistent variance 
parameter estimates, which are then used to obtain efficient GLS. A second example 
is use of easily obtained consistent estimates as starting values before maximizing a 
complicated likelihood function. Although there is no need to iterate further, in practice 
most researchers still prefer to iterate to convergence unless this is computationally too 
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time consuming. One advantage of iterating to convergence is that different researchers 
should obtain the same parameter estimates, whereas different initial root-N consistent 
estimates lead to second-round parameter estimates that will differ even though they 
are asymptotically equivalent. 


10.3.2. Method of Scoring 


A common modification of the NR method is the method of scoring (MS). In this 
method the Hessian matrix is replaced by its expected value 


2 
2 or | a (10.7) 


Os 


Hi =e) 
en | 3030" 


This substitution is especially advantageous when applied to the MLE (i.e., Qn(0) = 
N—'Ly(@)), because the expected value should be negative definite, since by the infor- 
mation matrix equality (see Section 5.6.3), Hms,s = E [aly /00 x dL /96'], which 
is positive definite since it is a covariance matrix. Obtaining the expectation in (10.7) 
is possible only for m-estimators and even then may be analytically difficult. 

The method of scoring algorithm for the MLE of generalized linear models, such 
as the Poisson, probit, and logit, can be shown to be implementable using iteratively 
reweighted least squares (see McCullagh and Nelder, 1989). This was advantageous to 
early adopters of these models who only had access to an OLS program. 

The method of scoring can also be applied to m-estimators other than the MLE, 
though then Hys,, may not be negative definite. 


10.3.3. BHHH Method 


The BHHH method of Berndt, Hall, Hall, and Hausman (1974) uses (10.1) with 
weighting matrix A, = -Hzn s where the matrix 


5 ðq:(0) 9qi(8) 


Hgunn,s = 90 ag’ 


, (10.8) 


s 


i=1 


and Qy(0@) = J; q;(@). Compared to NR, this has the advantage of requiring evalua- 
tion of first derivatives only, offering considerable computational savings. 

To justify this method, begin with the method of scoring for the MLE, in which case 
Qy(0) = 9°; In fi(0), where f;(@) is the log-density. The information matrix equality 
can be expressed as 


Ly (0)] _ X aln f,(0) & ain F0) 
al 0030’ |= aps 30 2 30' , 


i=l j=l 


and independence over i implies 


5 Eo _ ee ai) . 
3030 30 30 


i=l 
Dropping the expectation leads to (10.8). 
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The BHHH method can also be applied to estimators other than the MLE, in which 
case it is viewed as simply another choice of matrix A, in (10.1) rather than as an 
estimate of the Hessian matrix H,. 

The BHHH method is used for many cross-section m-estimators as it can work well 
and requires only first derivatives. 


10.3.4. Method of Steepest Ascent 


The method of steepest ascent sets A, = I}, the simplest choice of weighting matrix. 
A line search is then done (see (10.3)) to scale I, by a constant Às. 

The line search can be down manually. In practice it is common to use the optimal 
A for the line search, which can be shown to be A; = —g'.g;/g' Hsg;, where H, is the 
Hessian matrix. This optimal à, requires computation of the Hessian, in which case 
one might instead use NR. The advantage of steepest ascent rather than NR is that H, 
can be singular, though H, still needs to be negative definite to ensure 4, < 0 so that 
AsI, is negative definite. 


10.3.5. DFP and BFGS Methods 


The DFP algorithm due to Davidon, Fletcher, and Powell is a gradient method with 
weighting matrix A, that is positive definite and requires computation of only first 
derivatives, unlike NR, which requires computation of the Hessian. Here the method 
is presented without derivation. 

The weighting matrix A, is computed by the recursion 


65-165, Ms—1V5—1¥g—1As—1 
81 Ys- FY, Asse 


A, = As-1 T , (10.9) 


where ós-1 = As—1gs-1 and Y,—1 = Zs — Zs—-1. By inspection of the right-hand side 
of (10.9), A; will be positive definite provided the initial Ao is positive definite (e.g., 
Ao = I). 

The procedure converges quite well in many statistical applications. Eventually A, 
goes to the theoretically preferred —H,'. In principle this method can also provide 
an approximate estimate of the inverse of the Hessian for use in computation of stan- 
dard errors, without needing either second derivatives or matrix inversion. In practice, 
however, this estimate can be a poor one. 

A refinement of the DFP algorithm is the BFGS algorithm of Boyden, Fletcher, 
Goldfarb, and Shannon with 


Cae Nee ay Aes 


/ 


As = As-1 + 7 
s—1Ys—1 YY, As—1Y 5-1 


aA DNs-1 na , (10.10) 


where n,- = (6s-1/6,_1Ys—1) = (As—1°Y 5-1/1 As-1Y 5-1): 
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10.3.6. Gauss—Newton Method 


The Gauss—Newton (GN) method is an iterative method for the NLS estimator that 
can be implemented by iterative OLS. 

Specifically, for NLS with conditional mean function g(x;, 3), the GN method sets 
the parameter change vector (B, 41 - ĝ, ) equal to the OLS coefficient estimates from 
the artificial regression 


Ax ag; 
i — 8%, Ps) = Ta i- 10.11 
Yi — 8(%, Bs) = TAr N (10.11) 
Equivalently, B, +1 equals the OLS coefficient estimates from the artificial regression 
za | OBE | a _ Ii 
Yi — 8(%, Bs) — ee = zy|. Btw. (10.12) 
aP |p 38 Ia, 


To derive this method, let B, be a starting value, approximate g(x;, 8) by a first- 
order Taylor series expansion 


ðgi 
JA 


and substitute this in the least-squares re function Qy(Q) to obtain the 


approximation 
2 
es B, ) 


ove) => E — (xi, B,) — ak 
B'\p 

But this is the sum of squared residuals for OLS regression of y; — g(x, B.) on 

dg; / abla with parameter vector (B — G,), leading to (10.11). More formally, 

ðgi 


-1 
~ ~ 08; 08; 
Baalo E ap’ Hl 2 38l 


This is the gradient method (10.1) with vector g, = };; ðgi/ IBI, (yi — g(Xi, B,)) 
weighted by matrix As = [}°; 9gi/03x4g;/08'|g I '. 

The iterative method (10.13) equals the method of scoring variant of the Newton- 
Raphson algorithm for NLS estimation since, from Section 5.8, the second sum on the 
right-hand side is the gradient vector and the first sum is minus the expected value 
of the Hessian (see also Section 10.3.9). The Gauss—Newton algorithm is therefore a 
special case of the Newton—Raphson, and NR is emphasized more here as it can be 
applied to a much wider range of problems than can GN. 


g(x;, B) = g(x, B,) + (B —B,), 


Oi — g(%;, B,))- (10.13) 


10.3.7. Expectation Maximization 


There are a number of data and model formulations considered in this book that can be 
thought of as involving incomplete or missing data. For example, outcome variables of 
interest (e.g., expenditure or the length of a spell in some state) may be right-censored. 
That is, for some cases we may observe the actual expenditure or spell length, whereas 
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in other cases we may only know that the outcome exceeded some specific value, say 
c*. A second example involves a multiple regression in which the data matrix looks as 


follows: 
yi Xı 
? X, |’ 


where ? stands for missing data. Here we envisage a situation in which we wish to 
estimate a linear regression model y = X8 + u, where y’ = [y ?], X = [X X2], 
but a subset of variables y is missing. A third example involves estimating the parame- 


ters (01, 02, ..., OC, T1, .. ., Tc) of a C-component mixture distribution, also called a 
latent class model, h (y|X) = Yi mj fj (yX; 6;), where f; (yjIX;, 6;) are well- 
defined pdfs. Here z; (j = 1,..., C) are unknown sampling fractions corresponding 


to the C latent densities from which the observations are sampled. It is convenient to 
think of this problem also as a missing data problem in the sense that if the sampling 
fractions were known constants then estimation would be simpler. 

The expectation maximization (EM) framework provides a unifying framework 
for developing algorithms for problems that can be interpreted as involving miss- 
ing data. Although particular solutions to this type of estimation problem have long 
been found in the literature, Dempster, Laird, and Rubin (1977) provided a definitive 
treatment. 

Let y denote the vector dependent variable of interest, determined by the under- 
lying latent variable vector y*. Let f*(y*|X, 0) denote the joint density of the latent 
variables, conditional on regressors X, and let f(y|X, 0) denote the joint density of 
the observed variables. Let there be a many-to-one mapping from the sample space 
of y to that of y*; that is, the value of the latent variable y* uniquely determines 
y, but the value of y does not uniquely determine y*. It follows that f(y|X, 6) = 
S*(y* |X, 0)/fCy*ly, X, 0), since from Bayes rule the conditional density f(y*|y) = 
fly, YDF) = f*(y")/ fly), where the final equality uses f(y*, y) = f*(y*) as y* 
uniquely determines y. Rearranging gives f(y) = f*(y*)/f(y"ly). 

The MLE maximizes 


1 1 1 
OnO) = FEO = = In S“ Y*IX, 0) -y nfo'ly, X, 0). (10.14) 


Because y* is unobserved the first term in the log-likelihood is ignored. The second 
term is replaced by its expected value, which will not involve y*, where at the sth 
round this expectation is evaluated at 0 —6,. 

The expectation (E) part of the EM algorithm calculates 


mn 1 5 
Qn (00s) = —E È In f(y"ly, X, ly, x8, (10.15) 


where expectation is with respect to the density f(y*ly, X,0,). The maximization (M) 
part of the EM algorithm maximizes Q ROCA to obtain 6541: 

The full EM algorithm is iterative. The likelihood is maximized, given the expected 
value of the latent variable; the expected value is evaluated afresh given the current 
value of 0. The iterative process continues until convergence is achieved. The EM 
algorithm has the advantage of always leading to an increase or constancy in Qy(@); 
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see Amemiya (1985, p. 376). The EM algorithm is applied to a latent class model in 
Section 18.5.3 and to missing data in Section 27.5. 

There is a very extensive literature on situations where the EM algorithm can be 
usefully applied, even though it can be applied to only a subset of optimization prob- 
lems. The EM algorithm is easy to program in many cases and its use was further en- 
couraged by considerations of limited computing power and storage that are no longer 
paramount. Despite these attractions, for censored data models and latent class models 
direct estimation using Newton—Raphson type iterative procedures is often found to be 
faster and more efficient computationally. 


10.3.8. Simulated Annealing 


Simulated annealing (SA) is a nongradient iterative method reviewed by Goffe, 
Ferrier, and Rogers (1994). It differs from gradient methods in permitting movements 
that decrease rather than increase the objective function to be maximized, so that one 
is not locked in to moving steadily toward one particular local maximum. 

Given a value 6, at the sth round we perturb the jth component of 6, to obtain a 
new trial value of 


0: = 8, +[0---0 (ajrj) 0--- 0], (10.16) 


where À; is a prespecified step length and r; is a draw from a uniform distribution on 
(-1, 1). The new trial value is used, that is, the method sets O41 = | = 0%, if it increases 
the objective function, or if it does not increase the value of the objective function but 
does pass the Metropolis criterion that 


exp ((On(6") — On(@,))/Ts) > u, (10.17) 


where u is a drawing from a uniform (0, 1) distribution and T, is a scaling parameter 
called the temperature. Thus not only uphill moves are accepted, but downhill moves 
are also accepted with a probability that decreases with the difference between Q n(O*) 
and Q sð, ) and that increases with the temperature. The terms simulated annealing 
and temperature come from analogy with minimizing thermal energy by slowly cool- 
ing (annealing) a molten metal. 

The user needs to set the step-size parameter A ;. Goffe et al. (1994) suggest period- 
ically adjusting A; so that 50% of all moves over a number of iterations are accepted. 
The temperature also needs to be chosen and reduced during the course of iterations. 
Then the algorithm initially is searching over a wide range of parameter values before 
steadily locking in on a particular region. 

Fast simulated annealing (FSA), proposed by Szu and Hartley (1987), is a faster 
method. It replaces the uniform (—1, 1) random number r; by a Cauchy random vari- 
able r; scaled by the temperature and permits a fixed step length v;. The method also 
uses a simpler adjustment of the temperature over iterations with T, equal to the ini- 
tial temperature divided by the number of FSA iterations, where one iteration is a full 
cycle over the q components of 8. 

Cameron and Johansson (1997) discuss and use simulated annealing, following the 
methods of Horowitz (1992). This begins with FSA but on grounds of computational 
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savings switches to gradient methods (BFGS) when relatively little change in Qy(-) 
occurs over a number of iterations or after many (250) FSA iterations. In a simulation 
they find that NR with a number of different starting values offers a considerable im- 
provement over NR with just one set of starting values, but even better is FSA with a 
number of different starting values. 


10.3.9. Example: Exponential Regression 


Consider the nonlinear regression model with exponential conditional mean 


ELyi|x;] = exp; ð), (10.18) 
where x; and 8 are K x 1 vectors. The NLS estimator B minimizes 
OnB) = X Oi — exp, By, (10.19) 


where for notational simplicity scaling by 2/N is ignored. The first-order conditions 
are nonlinear in 8 and there is no explicit solution for G. Instead, gradient methods 
need to be used. 

For this example the gradient and Hessian are, respectively, 


g=-2 Yi — E&P eX By, (10.20) 


and 


H=2)- [eP xix — 2y; — Pye axi] : (10.21) 


The NR iterative method (10.5) uses g, and H, equal to (10.20) and (10.21) evaluated 
at Â,. 


A simpler method of scoring variation of NR notes that (10.18) implies 


E[H] = 2 ye ERP eXiBx x! (10.22) 


Using E[H,] in place of H, yields 


-1 
Boi = Bs = be enn D Px — e™b), 
i i 
It follows that 2, 417 B, can be computed from OLS regression of (y; — ex's) on 


eXx;, This is also the Gauss—Newton regression (10.11), since dg(x;, 6)/36 = 
exp(x;3,)x; for the exponential conditional mean (10.18). Specialization to 
exp(x; 3) = exp(f) gives the iterative procedure presented in Section 10.2.4. 


10.4. Practical Considerations 


Some practical issues have already been presented in Section 10.2, notably conver- 
gence criteria, modifications such as step-size adjustment, and the use of numerical 
rather than analytical derivatives. In this section a brief overview of statistical packages 
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is given, followed by a discussion of common pitfalls that can arise in computation of 
a nonlinear estimator. 


10.4.1. Statistical Packages 


All standard microeconometric packages such as Limdep, Stata, PCTSP, and SAS have 
built-in procedures to estimate basic nonlinear models such as logit and probit. These 
packages are simple to use, requiring no knowledge of iterative methods or even of the 
model being used. For example, the command for logit regression might be “logit y 
x” rather than the command “ols y x” for OLS. Nonlinear least squares requires some 
code to convey to the package the particular functional form for g(x, 3) one wishes 
to specify. Estimation should be quick and accurate as the program should exploit the 
structure of the particular model. For example, if the objective function is globally 
concave then the method of scoring might be used. 

If a statistical package does not contain a particular model then one needs to write 
one’s own code. This situation can arise with even minor variation of standard mod- 
els, such as imposing restrictions on parameters or using parameterizations that are 
not of single-index form. The code may be written using one’s own favorite statistical 
package or using other more specialized programming languages. Possibilities include 
(1) built-in optimization procedures within the statistical package that require spec- 
ification of the objective function and possibly its derivatives; (2) matrix commands 
within the statistical package to compute A, and g, and iterate; (3) a matrix program- 
ming language such as Gauss, Matlab, OX, SAS/IML, or S-Plus, and possibly add-on 
optimization routines; (4) a programming language such as Fortran or C++; and (5) an 
optimization package such as those in GAMS, GQOPT, or NAGLIB. 

The first and second methods are attractive because they do not force the user to 
learn a new program. The first method is particularly simple for m-estimation as it can 
require merely specification of the subfunction q;(@) for the ith observation rather than 
specification of Qy(0). In practice, however, the optimization procedures for user- 
defined functions in the standard packages are more likely to encounter numerical 
problems than if more specialized programs are used. Moreover, for some packages 
the second method can require learning arcane forms of matrix commands. 

For nonlinear problems, the third method is the best, although this might require the 
user to learn a matrix programming language from scratch. One then is set up to han- 
dle virtually any econometric problem encountered, and the optimization routines that 
come with matrix programming languages are usually adequate. Also, many authors 
make available the code used in specific papers. 

The fourth and fifth methods generally require a higher level of programming so- 
phistication than the third method. The fourth method can lead to much faster compu- 
tation and the fifth method can solve the most numerically challenging optimization 
problems. 

Other practical issues include cost of software; the software used by colleagues; and 
whether the software has clear error messages and useful debugging features, such as a 
trace program that tracks line-by-line program execution. The value of using software 
similar to that used by other colleagues cannot be underestimated. 
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Table 10.2. Computational Difficulties: A Partial Checklist 


Problem Check 

Data read incorrectly Print full descriptive statistics. 

Imprecise calculation Use analytical derivatives or numerical with different 
step size h. 

Multicollinearity Check condition number of X’X. Try subset of regressors. 

Singular matrix in iterations Try method not requiring matrix inversion such as DFP. 

Poor starting values Try a range of different starting values. 

Model not identified Difficult to check. Obvious are dummy variable traps. 

Strange parameter values Constant included/excluded? Iterations actually 
converged? 

Different standard errors Which method was used to calculate variance matrix? 


10.4.2. Computational Difficulties 


Computational difficulties are, in practice, situations where it is not possible to obtain 
an estimate of the parameters. For example, an error message may indicate that the 
estimator cannot be calculated because the Hessian is singular. There are many possi- 
ble reasons for this, as detailed in the following and summarized in Table 10.2. These 
reasons may also provide explanation for another common situation of parameter esti- 
mates that are obtained but are seemingly in error. 

First, the data may not have been read in correctly. This is a remarkably common 
oversight. With large data sets it is not practical to print out all the data. However, at a 
minimum one should always obtain descriptive statistics and check for anomilies such 
as incorrect range for a variable, unusually large or small sample mean, and unusu- 
ally large or small standard deviation (including a value of zero, which indicates no 
variation). See Section 3.5.4 for further details. 

Second, there may be calculation errors. To minimize these all calculations should 
be done in double precision or even quadruple precision rather then single precision. 
It is helpful to rescale the data so that the regressors have similar means and variances. 
For example, it may be better to use annual income in thousands of dollars rather than 
in dollars. If numerical derivatives are used it may be necessary to alter the change 
value h in (10.4). Care needs to be paid to how functions are evaluated. For example, 
the function InT'(y), where T(-) is the gamma function, is best evaluated using the 
log-gamma function. If instead one evaluates the gamma function followed by the log 
function considerable numerical error arises even for moderate sized y. 

Third, multicollinearity may be a problem. In single-index models (see Sec- 
tion 5.2.4) the usual checks for multicollinearity will carry over. The correlation matrix 
for the regressors can be printed, though this only considers pairwise correlation. Bet- 
ter is to use the condition number of X’X, that is, the square root of the ratio of the 
largest to smallest eigenvalue of X’X. If this exceeds 100 then problems may arise. For 
more highly nonlinear models than single-index ones it is possible to have problems 
even if the condition number is not large. If one suspects multicollinearity is causing 
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numerical problems then see whether it is possible to estimate the model with a subset 
of the variables that are less likely to be collinear. 

Fourth, a noninvertible Hessian during iterations does not necessarily imply singu- 
larity at the true maximum. It is worthwhile trying a range of iterative methods such 
as steepest ascent with line search and DFP, not just Newton—Raphson. This problem 
may also result from multicollinearity. 

Fifth, try different starting values. The iterative gradient methods are designed to 
obtain a local maximum rather than the global maximum. One way to guard against 
this is to begin iterations at a wide range of starting values. A second way is to per- 
form a grid search. Both of these approaches theoretically require evaluations at many 
different points if the dimension of @ is large, but it may be sufficient to do a detailed 
analysis for a stripped-down version of the model that includes just the few regressors 
thought to be most statistically significant. 

Lastly, the model may not be identified. Indeed a standard necessary condition for 
model identification is that the Hessian be invertible. As with linear models, sim- 
ple checks include avoiding dummy variable traps and, if a subset of data is being 
used in initial analysis, determining that all variables in the subset of the data have 
some variation. For example, if data are ordered by gender or by age or by region 
then problems can arise if these appear as indicator variables and the chosen subset 
is of individuals of a particular gender, age, or region. For nonlinear models it can 
be difficult to theoretically determine that the model is not identified. Often one first 
eliminates all other potential causes before returning to a careful analysis of model 
identification. 

Even after parameter estimates are successfully obtained computational problems 
can still arise, as it may not be possible to obtain estimates of the variance matrix 
A~'BA‘~!. This situation can arise when the iterative method used, such as DFP, does 
not use the Hessian matrix A~! as the weighting matrix in the iterations. First check 
that the iterative method has indeed converged rather than, for example, stopping at 
a default maximum number of iterations. If convergence has occurred, try alternative 
estimates of A, using the expected Hessian or using more accurate numerical com- 
putations by, for example, using analytical rather than numerical derivatives. If such 
solutions still fail it is possible that the model is not identified, with this nonidentifica- 
tion being finessed at the parameter estimation stage by using an iterative method that 
did not compute the Hessian. 

Other perceived computational problems are parameter and variance estimates that 
do not accord with prior beliefs. For parameter estimates obvious checks include en- 
suring correct treatment of an intercept term (inclusion or exclusion, depending on the 
context), that convergence has been achieved, and that a global maximum is obtained 
(by trying a range of starting values). If standard errors of parameter estimates dif- 
fer across statistical packages that give the same parameter estimates, the most likely 
cause is that a different method has been used to construct the variance matrix estimate 
(see Section 5.5.2). 

A good computational strategy is to start with a small subset of the data and regres- 
sors, say one regressor and 100 observations. This simplifies detailed tracing of the 
program either manually, such as by printing out key output along the way, or using 
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a built-in trace facility if the program has one. If the program passes this test then 
computational problems with the full model and data are less likely to be due to in- 
correct data input or coding errors and are more likely due to genuine computational 
difficulties such as multicollinearity or poor starting values. 

A good way to test program validity is to construct a simulated data set where the 
true parameters are known. For a large sample size, say N = 10,000, the estimated 
parameter values should be close to the true values. 

Finally, note that obtaining reasonable computational results from estimation of a 
nonlinear model does not guarantee correct results. For example, many early pub- 
lished applications of multinomial probit models reported apparently sensible results, 
yet the models estimated have subsequently been determined to be not identified (see 
Section 15.8.1). 


10.5. Bibliographic Notes 


Numerical problems can arise even in linear models, and it is instructive to read Davidson and 
MacKinnon (1993, Section 1.5) and Greene (2003, appendix E). Standard references for statis- 
tical computation are Kennedy and Gentle (1980) and especially Press et al. (1993) and related 
co-authored books by Press. For evaluation of functions the standard reference is Abramowitz 
and Stegun (1971). Quandt (1983) presents many computational issues, including optimization. 


5.3 Summaries of iterative methods are given in Amemiya (1985, Section 4.4), Davidson and 
MacKinnon (1993, Section 6.7), Maddala (1977, Section 9.8), and especially Greene (2003, 
appendix E.6). Harvey (1990) gives many applications of the GN algorithm, which, owing 
to its simplicity, is the usual iterative method for NLS estimation. For the EM algorithm see 
especially Amemiya (1985, pp. 375-378). For SA see Goffe et al. (1994). 


Exercises 


10-1 Consider calculation of the MLE in the logit regression model when the only re- 
gressor is the intercept. Then E[y] = 1/(1 + e-4) and the gradient of the scaled 
log-likelihood function g(8) = (y— 1/(1 + e-*)). Suppose a sample yields y= 
0.8 and the starting value is 6 = 0.0. 

(a) Calculate £ for the first six iterations of the Newton—Raphson algorithm. 
(b) Calculate the first six iterations of a gradient algorithm that sets As = 1 in 


(10.1), S0 Bs41 = Bs + Gs. 
(c) Compare the performance of the methods in parts (a) and (b). 


10-2 Consider the nonlinear regression model y= ax; + y/(X2 —5)+ u, where x; 
and x are exogenous regressors independent of the iid error u ~ N’[0, o°]. 
(a) Derive the equation for the Gauss—Newton algorithm for estimating (a, y, ô). 
(b) Derive the equation for the Newton—Raphson algorithm for estimating 
(a, y, ô). 
(c) Explain the importance of not arbitrarily choosing the starting values of the 
algorithm. 


10-3 Suppose that the pdf of y has a C-component mixture form, f(y|7) = 
eat) fj(y), where m= (m1,..., 7c), mj > 0, yay 1. The aj; are 
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unknown mixing proportions whereas the parameters of the densities f;(y) are 

presumed known. 

(a) Given a random sample on y, i = 1,..., N, write the general log-likelihood 
function and obtain the first-order conditions for Ty. Verify that there is no 
explicit solution for TuL- 

(b) Let z; be a C x 1 vector of latent categorical variables, i = 1,..., N, such 
that zj; = 1 if y comes from the jth component of the mixture and zj; = 0 
otherwise. Write down the likelihood function in terms of the observed and 
latent variables as if the latent variable were observed. 

(c) Devise an EM algorithm for estimating m. [Hint: If zj; were observable the 
MLE of 7; = N-' >; zji. The E step requires calculation of E[z;;| yj]; the M 
step requires replacing zj; by E[Z;;| y;] and then solving for z.] 

Let (Y1i, Y2i), /=1,..., N, have a bivariate normal distribution with mean 

(u1, 42) and covariance parameters (041,012,022) and correlation coefficient 

p. Suppose that all N observations on y; are available but there are m< N 

missing observations on y2. Using the fact that the marginal distribution of y; 

is N[uj, ojj], and that conditionally y>| yı ~ N[u21, 0221], where u21 = m2 + 

012/022(%4 — u1), 0221 = (1 — p?)o22, devise an EM algorithm for imputing the 

missing observations on y4. 
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Simulation-Based 
Methods 


Part 1 emphasized that microeconometric models are frequently nonlinear models es- 
timated using large and heterogeneous data sets drawn from surveys that are complex 
and subject to a variety of sampling biases. A realistic depiction of the economic phe- 
nomena in such settings often requires the use of models for which estimation and 
subsequent statistical inference are difficult. Advances in computing hardware and 
software now make it feasible to tackle such tasks. Part 3 presents modern, computer- 
intensive, simulation-based methods of estimation and inference that mitigate some of 
these difficulties. The background required to cover this material varies somewhat with 
the chapter, but the essential base is least squares and maximum likelihood estimation. 

Chapter 11 presents bootstrap methods for statistical inference. These methods have 
the attraction of providing a simple way to obtain standard errors when the formulae 
from asymptotic theory are complex, as is the case, for example, for some two-step 
estimators. Furthermore, if implemented appropriately, a bootstrap can lead to a more 
refined asymptotic theory that may then lead to better statistical inference in small 
samples. 

Chapter 12 presents simulation-based estimation methods. These methods permit 
estimation in situations where standard computational methods may not permit calcu- 
lation of an estimator, because of the presence of an integral over a probability distri- 
bution that leads to no closed-form solution. 

Chapter 13 surveys Bayesian methods that provide an approach to estimation and 
inference that is quite different from the classical approach used in other chapters 
of this book. Despite this different approach, in practice in large sample settings the 
Bayesian approach produces similar results to those from classical methods. Further, 
they often do so in a computationally more efficient manner. 
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Bootstrap Methods 


11.1. Introduction 


Exact finite-sample results are unavailable for most microeconometrics estimators 
and related test statistics. The statistical inference methods presented in preceding 
chapters rely on asymptotic theory that usually leads to limit normal and chi-square 
distributions. 

An alternative approximation is provided by the bootstrap, due to Efron (1979, 
1982). This approximates the distribution of a statistic by a Monte Carlo simulation, 
with sampling done from the empirical distribution or the fitted distribution of the ob- 
served data. The additional computation required is usually feasible given advances 
in computing power. Like conventional methods, however, bootstrap methods rely on 
asymptotic theory and are only exact in infinitely large samples. 

The wide range of bootstrap methods can be classified into two broad approaches. 
First, the simplest bootstrap methods can permit statistical inference when conven- 
tional methods such as standard error computation are difficult to implement. Second, 
more complicated bootstraps can have the additional advantage of providing asymp- 
totic refinements that can lead to a better approximation in-finite samples. Applied 
researchers are most often interested in the first aspect of the bootstrap. Theoreticians 
emphasize the second, especially in settings where the usual asymptotic methods work 
poorly in finite samples. 

The econometrics literature focuses on use of the bootstrap in hypothesis test- 
ing, which relies on approximation of probabilities in the tails of the distributions 
of statistics. Other applications are to confidence intervals, estimation of standard er- 
rors, and bias reduction. The bootstrap is straightforward to implement for smooth 
/N-consistent estimators based on iid samples, though bootstraps with asymptotic re- 
finements are underutilized. Caution is needed in other settings, including nonsmooth 
estimators such as the median, nonparametric estimators, and inference for data that 
are not iid. 

A reasonably self-contained summary of the bootstrap is provided in Section 11.2, 
an example is given in Section 11.3, and some theory is provided in Section 11.4. 
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Further variations of the bootstrap are presented in Section 11.5. Section 11.6 presents 
use of the bootstrap for specific types of data and specific methods used often in 
microeconometrics. 


11.2. Bootstrap Summary 


We summarize key bootstrap methods for estimator 6 and associated statistics based 
on an iid sample {w,,..., Wy}, where usually w; = (yi, x;) and @ is a smooth esti- 
mator that is VN consistent and asymptotically normally distributed. For notational 
simplicity we generally present results for scalar 9. For vector 0 in most instances 
replace 6 by 0j, the jth component of 0. 

Statistics of interest include the usual regression output: the estimate 6; standard er- 
rors sq; t-statistic tf = (6 — 00)/s%, where 6o is the null hypothesis value; the associated 
critical value or p-value for this statistic; and a confidence interval. 

This section presents bootstraps for each of these statistics. Some motivation is also 
provided, with the underlying theory sketched in Section 11.4. 


11.2.1. Bootstrap without Refinement 


Consider estimation of the variance of the sample mean f = y = N7! ae y;, where 
the scalar random variable y; is iid [u, 07], when it is not known that V[®] = 07/N. 

The variance of f could be obtained by obtaining S such samples of size N from the 
population, leading to S sample means and hence S estimates 7, = Ys, s =1,..., S. 
Then we could estimate V[fi] by (S — D! EL, — 7)”, where T = Ss! ES Zi. 

Of course this approach is not possible, as we only have one sample. A bootstrap 
can implement this approach by viewing the sample as the population. Then the finite 
population is now the actual data y1, ..., yy. The distribution of f can be obtained 
by drawing B bootstrap samples from this population of size N, where each bootstrap 
sample of size N is obtained by sampling from y\,..., yy with replacement. This 
leads to B sample means and hence B estimates fi, = Yp, b=1,..., B. Then esti- 
mate V[f] by (B - D! EZ f, — @)?, where T= B! YE i, Sampling with 
replacement may seem to be a departure from usual sampling methods, but in fact 
standard sampling theory assumes sampling with replacement rather than without re- 
placement (see Section 24.2.2). 

With additional information other ways to obtain bootstrap samples may be possi- 
ble. For example, if it is known that y; ~ NV[w, o°] then we could obtain B bootstrap 
samples of size N by drawing from the N[jZ, s?] distribution. This bootstrap is an 
example of a parametric bootstrap, whereas the preceding bootstrap was from the em- 
pirical distribution. 

More generally, for estimator 6 similar bootstraps can be used to, for example, 
estimate vô] and hence standard errors when analytical formulas for vô] are com- 
plex. Such bootstraps are usually valid for observations w; that are iid over i, and they 
have similar properties to estimates obtained using the usual asymptotic theory. 
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11.2.2. Asymptotic Refinements 


In some settings it is possible to improve on the preceding bootstrap and obtain es- 
timates that are equivalent to those obtained using a more refined asymptotic theory 
that may better approximate the finite-sample distribution of O. Much of this chapter 
is directed to such asymptotic refinements. 


Usual asymptotic theory uses the result that WAN @ — o) a NO, 07]. Thus 
PrlVN@ — 4) /o < z] = (z) +R, (11.1) 


where ®(-) is the standard normal cdf and R; is a remainder term that disappears as 
N > œ. 

This result is based on asymptotic theory detailed in Section 5.3 that includes ap- 
plication of a central limit theorem. The CLT is based on a truncated power-series 
expansion. The Edgeworth expansion, detailed in Section 11.4.3, includes additional 
terms in the expansion. With one extra term this yields 


PrLVN@ — @)/o < z] = B@) + ee 


+ Ro, (11.2) 
where @(-) is the standard normal density, g;(-) is a bounded function given after 
(11.13) in Section 11.4.3 and R3 is a remainder term that disappears as N — oo. 

The Edgeworth expansion is difficult to implement theoretically as the function 
gi(-) is data dependent in a complicated way. A bootstrap with asymptotic refinement 
provides a simple computational method to implement the Edgeworth expansion. The 
theory is given in Section 11.4.4. 

Since Rj = O(N!) and Rə = O(N7!), asymptotically Rọ < Ri, leading to a 
better approximation as N — oo. However, in finite samples it is possible that Ry > 
R,. A bootstrap with asymptotic refinement provides a better approximation asymptot- 
ically that hopefully leads to a better approximation in samples of the finite sizes typ- 
ically used. Nevertheless, there is no guarantee and simulation studies are frequently 
used to verify that finite-sample gains do indeed occur. 


11.2.3. Asymptotically Pivotal Statistic 


For asymptotic refinement to occur, the statistic being bootstrapped must be an asymp- 
totically pivotal statistic, meaning a statistic whose limit distribution does not depend 
on unknown parameters. This result is explained in Section 11.4.4. 

As an example, consider sampling from y; ~ [u, 07]. Then the estimate i = j ~ 
N[u, o7/N] is not asymptotically pivotal even given a null hypothesis value u = uo 
since its distribution depends on the unknown parameter o?. However, the studentized 
statistic t = (m — uo)/si AN [0, 1] is asymptotically pivotal. 

Estimators are usually not asymptotically pivotal. However, conventional asymp- 
totically standard normal or chi-squared distributed test statistics, including Wald, 
Lagrange multiplier, and likelihood ratio tests, and related confidence intervals, are 
asymptotically pivotal. 
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11.2.4. The Bootstrap 


In this section we provide a broad description of the bootstrap, with further details 
given in subsequent sections. 


Bootstrap Algorithm 


A general bootstrap algorithm is as follows: 


1. Given data w),..., Wy, draw a bootstrap sample of size N using a method given in the 
following and denote this new sample wy, ..., WẸ- 


2. Calculate a an appropriate statistic using the bootstrap sample. Examples include (a) the 
estimate. g~“ of 0, (b) the standard error sọ» of the estimate g~ , and (c) a t-statistic 
= @* — 0) /sg* centered at the original estimate @. Here 0” and sg* are calculated in 

ate usual way but using the new bootstrap sample rather than the original sample. 


3. Repeat steps 1 and 2 B independent times, where B is a large ı number, obtaining B 
bootstrap replications of the statistic of interest, such as H pis On or ty, ..., th. 


4. Use these B bootstrap replications to obtain a bootstrapped version of the statistic, as 
detailed in the following subsections. 


Implementation can vary according to how bootstrap samples are obtained, how 
many bootstraps are performed, what statistic is being bootstrapped, and whether or 
not that statistic is asymptotically pivotal. 


Bootstrap Sampling Methods 


The bootstrap dgp in step 1 is used to approximate the true unknown dgp. 

The simplest bootstrapping method is to use the empirical distribution of the data, 
which treats the sample as being the population. Then wï, ..., Wy are obtained by 
sampling with replacement from w,,..., Wy. In each bootstrap sample so obtained, 
some of the original data points will appear multiple times whereas others will not 
appear at all. This method is an empirical distribution function (EDF) bootstrap 
or nonparametric bootstrap. It is also called a paired bootstrap since in single- 
equation regression models w; = (y;, X;), so here both y; and x; are resampled. 

Suppose the conditional distribution of the data is specified, say y|x ~ F(x, 0o), and 
an estimate @ 5 Oo is available. Then in step 1 we can instead form a bootstrap sample 
by using the original x; while generating y; by random draws from F (x;, 8). This 
corresponds to regressors fixed in repeated samples (see Section 4.4.5). Alternatively, 
we may first resample x; from x;,...,X, and then generate y; from F(x, 6), a 
1,..., N. Both are examples of a parametric bootstrap that can be applied in fully 
parametric models. 

For regression model with additive iid error, say yi = 2(x;, B) + ui, we can form 


fitted residuals ™, ..., üy, where t; = y; — g(x;, B). Then in step 1 bootstrap from 
these residuals to get a new draw of residuals, say (a7, as Sun), leading to a bootstrap 


sample (yï, X1), ---, (YW, Xy), where y* = g(x;, Ø) + už. This bootstrap is called a 
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residual bootstrap. It uses information intermediate between the nonparametric and 
parametric bootstrap. It can be applied if the error term has distribution that does not 
depend on unknown parameters. 

We emphasize the paired bootstrap on grounds of its simplicity, applicability to 
a wide range of nonlinear models, and reliance on weak distributional assumptions. 
However, the other bootstraps generally provide a better approximation (see Horowitz, 
2001, p. 3185) and should be used if the stronger model assumptions they entail are 
warranted. 


The Number of Bootstraps 


The bootstrap asymptotics rely on N — œ and so the bootstrap can be asymptotically 
valid even for low B. However, clearly the bootstrap is more accurate as B — œœ. A 
sufficiently large value of B varies with one’s tolerance for bootstrap-induced simula- 
tion error and with the purpose of the bootstrap. 

Andrews and Buchinsky (2000) present an application-specific numerical method 
to determine the number of replications B needed to ensure a given level of accuracy 
or, equivalently, the level of accuracy obtained for a given value of B. Let A denote 
the quantity of interest, such as a standard error or a critical value, Roo denote the ideal 
bootstrap estimate with B = ov, and Ap denote the estimate with B bootstraps. Then 
Andrews and Buchinsky (2000) show that 


VBG — hoo) /heo > NIO, œ], 


where w varies with the application and is defined in Table III of Andrews and Buchin- 
sky (2000). It follows that Pr[ô < z72./@/B] = 1 — t, where 5 = RI tool [ox 
denotes the relative discrepancy caused by only B replications. Thus B > wz? pl 8? 
ensures the relative discrepancy is less than ô with probability at least 1 — t. Alterna- 
tively, given B replications the relative discrepancy is less than ô = Z7/2./w/B. 

To provide concrete guidelines we propose the rule of thumb that 


B = 3840. 


This ensures that the relative discrepancy is less than 10% with probability at least 
0.95, since Zs /0.1? = 384. The only difficult part in implementation is estimation of 
œ, which varies with the application. 

For standard error estimation œ = e + y4)/4, where y4 is the coefficient of excess 
kurtosis for the bootstrap estimator oO. Intuitively, fatter tails in the distribution of the 
estimator mean outliers are more likely, contaminating standard error estimation. It 
follows that B = 384 x (1/2) = 192 is enough if y4 = 0 whereas B = 960 is needed 
if y4 = 8. These values are higher than those proposed by Efron and Tibsharani (1993, 
p. 52), who state that B = 200 is almost always enough. 

For a symmetric two-sided test or confidence interval at level a, œ = a(1 — 
)/[2Z0/26(Za/2) 1°. This leads to B = 348 for a = 0.05 and B = 685 for a = 0.01. 
As expected more bootstraps are needed the further one goes into the tails of the 
distribution. 
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For a one-sided test or nonsymmetric two-sided test or confidence interval at level 
a, œw = a(1 — a)/[zeb(Za)|*. This leads to B = 634 for æ = 0.05 and B = 989 for 
a = 0.01. More bootstraps are needed when testing in one tail. For chi-squared tests 
with h degrees of freedom w = a(1 — XEM fOr, where f(-) is the x?(h) 
density. 

For test p-values w = (1 — p)/p. For example, if p = 0.05 then œ = 19 and B = 
7,296. Many more bootstraps are needed for precise calculation of the test p-value 
compared to hypothesis rejection if a critical value is exceeded. 

For bias-corrected estimation of 6 a simple rule uses © = ° /e, where the esti- 
mator @ has standard error 6. For example, if the usual t-statistic t = 8E = 2 then 
© = 1/4 and B = 96. Andrews and Buchinsky (2000) provide many more details and 
refinements of these results. 

For hypothesis testing, Davidson and MacKinnon (2000) provide an alternative 
approach. They focus on the loss of power caused by bootstrapping with finite B. 
(Note that there is no power loss if B = oo.) On the basis of simulations they recom- 
mend at least B = 399 for tests at level 0.05, and at least B = 1,499 for tests at level 
0.01. They argue that for testing their approach is superior to that of Andrews and 
Buchinsky. 

Several other papers by Davidson and MacKinnon, summarized in MacKinnon 
(2002), emphasize practical considerations in bootstrap inference. For hypothesis test- 
ing at level a choose B so that a(B + 1) is an integer. For example, at œ = 0.05 let 
B = 399 rather than 400. If instead B = 400 it is unclear on an upper one-sided al- 
ternative test whether the 20th or 21st largest bootstrap t-statistic is the critical value. 
For nonlinear models computation can be reduced by performing only a few Newton- 
Raphson iterations in each bootstrap sample from starting values equal to the initial 
parameter estimates. 


11.2.5. Standard Error Estimation 


The bootstrap estimate of variance of an estimator is the usual formula for estimating 
a variance, applied to the B bootstrap replications H peas Op: 


$3 oot = rae? (11.3) 


where 


B 
= BY 2, (11.4) 
b=1 

Taking the square root yields 53 goo: the bootstrap estimate of the standard error. 
This bootstrap provides no asymptotic refinement. Nonetheless, it can be ex- 
traordinarily useful when it is difficult to obtain standard errors using conventional 
methods. There are many examples. The estimate 8 may be a sequential two-step 
m-estimator whose standard error is difficult to compute using the results given in 
Secttion 6.8. The estimate 0 may be a 2SLS estimator estimated using a package that 
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only reports standard errors assuming homoskedastic errors but the errors are actu- 
ally heteroskedastic. The estimate a may be a function of other parameters that are 
actually estimated, for example, ê= a/B, and the bootstrap can be used instead of 
the delta method. For clustered data with many small clusters, such as short panels, 
cluster-robust standard errors can be obtained by resampling the clusters. 

Since the bootstrap estimate 57 goot 1$ consistent, it can be used in place of sg in 
the usual asymptotic formula to form confidence intervals and hypothesis tests that 
are asymptotically valid. Thus asymptotic statistical inference is possible in settings 
where it is difficult to obtain standard errors by other methods. However, there will be 
no improvement in finite-sample performance. To obtain an asymptotic refinement 
the methods of the next section are needed. 


11.2.6. Hypothesis Testing 


Here we consider tests on an individual coefficient, denoted 6. The test may be either 
an upper one-tailed alternative of Ho : 0 < 6o against Ha : 6 > 9 or a two-sided test 
of Ho : 0 = Oo against H, : 0 Æ 4. Other tests are deferred to Section 11.6.3. 


Tests with Asymptotic Refinement 


The usual test statistic Ty = @ — 6o)/s% provides the potential for asymptotic refine- 
ment, as it is asymptotically pivotal since its asymptotic standard normal distribution 
does not depend on unknown parameters. We perform B bootstrap replications pro- 
ducing B test statistics tř, ..., t, where 


i=, —8)/s5°. (11.5) 


The estimates tý are centered around the original estimate @ since resampling is 
from a distribution centered around 6. The empirical distribution of tř, ...,t%, OT- 
dered from smallest to largest, is then used to approximate the distribution of Ty as 
follows. 

For an upper one-tailed alternative test the bootstrap critical value (at level œ) 
is the upper a quantile of the B ordered test statistics. For example, if B = 999 and 
a = 0.05 then the critical value is the 950th highest value of t*, since then (B + 1)(1 — 
a) = 950. For a similar lower tail one-sided test the critical value is the 50th smallest 
value of t*. 

One can also compute a bootstrap p-value in the obvious way. For example, if the 
original statstistic t lies between the 914th and 915th largest values of 999 bootstrap 
replicates then the p-value for a upper one-tailed alternative test is 1 — 914/(B + 1) = 
0.086. 

For a two-sided test a distinction needs to be made between symmetrical and 
nonsymmetrical tests. For a nonsymmetrical test or equal-tailed test the bootstrap 
critical values (at level œ) are the lower œ/2 and upper a/2 quantiles of the ordered 
test statistics t*, and the null hypothesis is rejected at level a if the original t-statistic 
lies outside this range. For a symmetrical test we instead order |t*| and the bootstrap 
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critical value (at level œ) is the upper œ quantile of the ordered |t*|. The null hypoth- 
esis is rejected at level « if |t| exceeds this critical value. 

These tests, using the percentile-t method, provide asymptotic refinements. For a 
one-sided t-test and for a nonsymmetrical two-sided t-test the true size of the test is 
a + O(N~'/*) with standard asymptotic critical values and œ + O(N~') with boot- 
strap critical values. For a two-sided symmetrical t-test or for an asymptotic chi- 
square test the asymptotic approximations work better, and the true size of the test 
is œ + O(N~') using standard asymptotic critical values and œ + O(N~) using boot- 
strap critical values. 


Tests without Asymptotic Refinement 


Alternative bootstrap methods can be used that although asymptotically valid do not 
provide an asymptotic refinement. 

One approach already mentioned at the end of Section 11.2.5 is to compute t = 
(6 — @)/ SS boot Where the bootstrap estimate SẸ poot given in (11.3) replaces the usual 
estimate sz, and compare this test statistic to critical values from the standard normal 
distribution. 

A second approach, exposited here for a two-sided test of Ho : 0 = 0 against 
H, : 0 Æ 4, finds the lower a/2 and upper a/2 quantiles of the bootstrap estimates 
CH aiai On and rejects Ho if 4 falls outside this region. This is called the percentile 
method. Asymptotic refinement is obtained by using tý in (11.5) that centers around 
@ rather than 6o and using a different standard error s+ in each bootstrap. 

These two bootstraps have the attraction of not requiring computation of sj, the 
usual standard error estimate based on asymptotic theory. 


11.2.7. Confidence Intervals 


Much of the statistics literature considers confidence interval estimation rather than its 
flip side of hypothesis tests. Here instead we began with hypothesis tests, so only a 
brief presentation of confidence intervals is necessary. 

An asymptotic refinement is based on the t-statistic, which is asymptotically piv- 
otal. Thus from steps 1-3 in Section 11.2.4 we obtain bootstrap replication t-statistics 
ti,...,¢g- Then let t —«/2 and tyz; denote the lower and upper w/2 quantiles of these 
t-statistics. The percentile-t method 100(1 — œ) percent confidence interval is 


(0 — tia X 89,0 + tia X 99) (11.6) 


where 6 and sọ are the estimate and standard error from the original sample. 

An alternative is the bias-corrected and accelerated (BC,) method detailed in 
Efron (1987). This offers an asymptotic refinement in a wider class of problems than 
the percentile-t method. 

Other methods provide an asymptotically valid confidence interval, but without 
asymptotic refinement. First, one can use the bootstrap estimate of the standard 
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error in the usual confidence interval formula, leading to interval (6 — Zf1—a/2] X 
SF boot? O+ Z[a/2] X SF boot); Second, the percentile method confidence interval is the 
distance between the lower a/2 and upper a/2 quantiles of the B bootstrap estimates 
Or... 06 of 6. 


11.2.8. Bias Reduction 


Nonlinear estimators are usually biased in finite samples, though this bias goes to zero 
asymptotically if the estimator is consistent. For example, if ju? is estimated by ê= 5’, 
where y; is iid [u, 02], then E[O — 3] = 302/N+E[(y — 1)3]/N2. 

More generally, for a V N-consistent estimator 


~ an by CN 
E[@ — 6] = bat 11.7 
[ o] E taat N? + (11.7) 
where ay, by, and cy are bounded constants that vary with the data and estimator (see 


Hall, 1992, p. 53). An alternative estimator ð provides an asymptotic refinement if 


TEA (11.8) 


where By and Cy are bounded constants. For both estimators the bias disappears as 
N — œ. The latter estimator has the attraction that the bias goes to zero at a faster 
rate, and hence it is an asymptotic refinement, though in finite samples it is possible 
that (By/N*) > (an/N + bn/N?). 

We wish to estimate the bias EO] — 0. This is the distance between the expected 
value or population average value of the parameter and the parameter value generating 
the data. The bootstrap replaces the population with the sample, so that the bootstrap 


= mr 
samples are generated by parameter 0, which has average value 0 over the bootstraps. 
The bootstrap estimate of the bias is then 


Bias; = (6 —6), (11.9) 


where T is defined in (11.4). 


Suppose, for example, that ô = 4 and T = 5. Then the estimated bias is (5 — 4) = 
1, an upward bias of 1. Since @ overestimates by 1, bias correction requires subtracting 
1 from 9, giving a bias-corrected estimate of 3. More generally, the bootstrap bias- 
corrected estimator of 6 is 


a=- -D (11.10) 

~ A 

=20—0. 

a 
Note that 6 itself is not the bias-corrected estimate. For more details on the direction 
of the correction, which may seem puzzling, see Efron and Tibsharani (1993, p. 138). 
For typical V N-consistent estimators the asymptotic bias of 6 is O(N!) whereas the 
asymptotic bias of Ogoo is instead O(N =2), 

In practice bias correction is seldom used for v N-consistent estimators, as the boot- 
strap estimate can be more variable than the original estimate 6 and the bias is often 
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small relative to the standard error of the estimate. Bootstrap bias correction is used 
for estimators that converge at rate less than v N, notably nonparametric regression 
and density estimators. 


11.3. Bootstrap Example 


As a bootstrap example, consider the exponential regression model introduced in Sec- 
tion 5.9. Here the data are generated from an exponential distribution with an expo- 
nential mean with two regressors: 


yj; |x; ~ exponential(A;), i = 1,..., 50, 
Ai = exp(B1 + B2x2; + B3x3:), 
(x2;, x37) ~ N[0.1, 0.1;0.17, 0.17, 0.005], 
(Bi, Bo, Bs) = (—2, 2, 2). 


Maximum likelihood estimation on a sample of 50 observations yields By = 
—2.192; B, = 0.267, s2 = 1.417, and t = 0.188; and B; = 4.664, s3 = 1.741, and 
t; = 2.679. For this ML example the standard errors were based on —A~', minus the 
inverse of the estimated Hessian matrix. 

We concentrate on statistical inference for 63 and demonstrate the bootstrap for 
standard error computation, test of statistical significance, confidence intervals, and 
bias correction. The differences between bootstrap and usual asymptotic estimates are 
relatively small in this example and can be much larger in other examples. 

The results reported here are based on the paired bootstrap (see Section 11.2.4) with 
(Yi, X2;, X3;) jointly resampled with replacement B = 999 times. From Table 11.1, the 
999 bootstrap replication estimates Bs. pb =1,..., 999, had mean 4.716 and standard 
deviation of 1.939. Table 11.1 also gives key percentiles for B: and t; (defined in the 
following). 

A parametric bootstrap could have been used instead. Then bootstrap samples 
would be obtained 1 by drawing y; from the exponential distribution with parameter 
exp(B| + Box; + B3x3i).- In the case of tests of Ho: bB = = 0 the exponential param- 
eter could instead be exp(B it Box), where By and Bo are then the restricted ML 
estimates from the original sample. 


Standard errors: From (11.3) the bootstrap estimate of standard error is computed 
using the usual standard deviation formula for the 999 bootstrap replication esti- 
mates of 63. This yields estimate 1.939 compared to the usual asymptotic standard 
error estimate of 1.741. Note that this bootstrap offers no refinement and would 
only be used as a check or if finding the standard error by other means proved 
difficult. 


Hypothesis testing with asymptotic refinement: We consider test of Ho : 63 = 0 
against H, : B3 Æ 0 at level 0.05. A test with asymptotic refinement is based on the 
t-statistic, which is asymptotically pivotal. From Section 11.2.6 for each bootstrap 
we compute t} = (B; — 4.664)/ SR which is centered on the estimate B3 = 4.664 
from the original sample. For a nonsymmetrical test the bootstrap critical values 


366 


11.3. BOOTSTRAP EXAMPLE 


Table 11.1. Bootstrap Statistical Inference on a Slope Coefficient: 


Example" 
B; tt z =1t(o0) t(47) 

Mean 4.716 0.026 1.021 1.000 
SD? 1.939 1.047 1.000 1.021 
1% —.336 —2.664 —2.326 —2.408 
2.5% 0.501 —2.183 —1.960 —2.012 
5% 1.545 —1.728 —1.645 —1.678 
25% 3.570 —0.621 —0.675 —0.680 
50% 4.772 0.062 0.000 0.000 
75% 5.971 0.703 0.675 0.680 
95% 7.811 1.706 1.645 1.678 
97.5% 8.484 2.066 1.960 2.012 
99.0% 9.427 2.529 2.326 2.408 


“ Summary statistics and percentiles based on 999 paired bootstrap resamples for 
(1) estimate Ba: (2) the associated statistics tł = (B3 —B3) / SB (3) student t- 
distribution with 47 degrees of freedom; (4) standard normal distribution. Original 
dgp is one draw from the exponential distribution given in the text; the sample size 
is 50. 

b SD, standard deviation. 


equal the lower and upper 2.5 percentiles of the 999 values of #3, the 25th lowest 
and 25th highest values. From Table 11.1 these are —2.183 and 2.066. Since the 
t-statistic computed from the original sample t3 = (4.664 — 0)/1.741 = 2.679 > 
2.066, the null hypothesis is rejected. A symmetrical test that instead uses the upper 
5 percentile of |t} | yields bootstrap critical value 2.078 that again leads to rejection 
of Ho at level 0.05. 

The bootstrap critical values in this example exceed those using the asymptotic 
approximation of either standard normal or t(47), an ad hoc finite-sample adjust- 
ment motivated by the exact result for linear regression under normality. So the 
usual asymptotic results in this example lead to overrejection and have actual size 
that exceeds the nominal size. For example, at 5% the z critical region values 
of (—1.960, 1.960) are smaller than the bootstrap critical values (—2.183, 2.066). 
Figure 11.1 plots the bootstrap estimate based on tš of the density of the t-test, 
smoothed using kernel methods, and compares it to the standard normal. The two 
densities appear close, though the left tail is notably fatter for the bootstrap estimate. 
Table 11.1 makes clearer the difference in the tails. 


Hypothesis testing without asymptotic refinement: Alternative bootstrap testing 
methods can be used but do not offer an asymptotic refinement. First, using the 
bootstrap standard error estimate of 1.939, rather than the asymptotic standard error 
estimate of 1.741, yields t3 = (4.664 — 0)/1.939 = 2.405. This leads to rejection at 
level 0.05 using either standard normal or t(47) critical values. Second, from Table 
11.1, 95% of the bootstrap estimates B; lie in the range (0.501, 8.484), which does 
not include the hypothesized value of 0, so again we reject Ho : 63 = 0. 
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Bootstrap Density of ‘t-Statistic’ 


Bootstrap Estimate 
Standard Normal 


Density 


t-statistic from each bootstrap replication 


Figure 11.1: Bootstrap density of t-test statistic for slope equal to zero obtained from 
999 bootstrap replications with standard normal density plotted for comparison. Data are 
generated from an exponential distribution regression model. 


Confidence intervals: An asymptotic refinement is obtained using the 95% percentile- 
t confidence interval. Applying (11.6) yields (4.664 — 2.183 x 1.741, 4.664 + 
2.066 x 1.741) or (0.864, 8.260). This compares to a conventional 95% asymptotic 
confidence interval of 4.664 + 1.960 x 1.741 or (1.25, 8.08). 

Other confidence intervals can be constructed, but these do not have an asymp- 
totic refinement. Using the bootstrap standard error estimate leads to a 95% con- 
fidence interval 4.664 + 1.960 x 1.939 = (0.864, 8.464). The percentile method 
uses the lower and upper 2.5 percentiles of the 999 bootstrap coefficient estimates, 
leading to a 95% confidence interval of (0.501, 8.484). 


Bias correction: The mean of the 999 bootstrap replication estimates of 63 is 
4.716, compared to the original estimate of 4.664. The estimated bias of (4.716 — 
4.664) = 0.052 is quite small, especially compared to the standard error of s3 = 
1.741. The estimated bias is upward and (11.10) yields a bias-corrected estimate of 
B3 equal to 4.664 — 0.052 = 4.612. 

The bootstrap relies on asymptotic theory and may actually provide a finite- 
sample approximation worse than that of conventional methods. To determine that 
the bootstrap is really an improvement here we need a full Monte Carlo analysis 
with, say, 1,000 samples of size 50 drawn from the exponential dgp, with each of 
these samples then bootstrapped, say, 999 times. 


11.4. Bootstrap Theory 
The exposition here follows the comprehensive survey of Horowitz (2001). Key results 


are consistency of the bootstrap and, if the bootstrap is applied to an asymptotically 
pivotal statistic, asymptotic refinement. 
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11.4.1. The Bootstrap 


We use X1,..., Xy as generic notation for the data, where for notational simplicity 
bold is not used for X; even though it is usually a vector, such as (y;, x;). The data are 
assumed to be independent draws from distribution with cdf Fo(x) = Pr[X < x]. In 
the simplest applications Fo is in a finite-dimensional family, with Fo = Fo(x, 0o). 

The statistic being considered is denoted Ty = Ty(X1,..., Xn). The exact finite- 
sample distribution of Ty is Gy = Gy(t, Fo) = Pr[Ty < t]. The problem is to find a 
good approximation to Gy. 

Conventional asymptotic theory uses the asymptotic distribution of Ty, denoted 
Goo = Goo(t, Fo). This may theoretically depend on unknown Fo, in which case we 
use a consistent estimate of Fo. For example, use Fo = Fo, 6), where @ is consistent 
for Oo. 

The empirical bootstrap takes a quite different approach to approximating 
Gwy(-, Fo). Rather than replace Gy by Gæ, the population cdf Fo is replaced by a 
consistent estimator Fy of Fo, such as the empirical distribution of the sample. 

Gy(-, Fy) cannot be determined analytically but can be approximated by boot- 
strapping. One bootstrap resample with replacement yields the statistic T = 
Ty(Xj,..., X3). Repeating this step B independent times yields replications 
Tři Seria Ty, g- The empirical cdf of Tři De Ty. g is the bootstrap estimate of the 
distribution of T, yielding 


Be 12 
Gut. F) = 3D MTN < D, (11.11) 
b=1 


where 1(A) equals one if event A occurs and equals zero otherwise. This is just the 
proportion of the bootstrap resamples for which the realized Ty < t. 
The notation is summarized in Table 11.2. 


11.4.2. Consistency of the Bootstrap 


The bootstrap estimate G n(t, Fy) clearly converges to Gy(t, Fy) as the number of 
bootstraps B — oo. Consistency of the bootstrap estimate Gy(t, Fy) for Gy(t, Fo) 


Table 11.2. Bootstrap Theory Notation 


Quantity Notation 

Sample (iid) X1,..., Xy, where X; is usually a vector 
Population cdf of X Fo = Fo(x, 0o) = Pr[X < x] 

Statistic of interest Ty = Ty(Xj,..., Xn) 

Finite sample cdf of Ty Gy = Gn(t, Fo) = Pr[Ty < t] 

Limit cdf of Ty Goo = Goolt, Fo) 
Asymptotic cdf of Ty Go = Gxt, Fo), where Fo = Fo(x, 0) 
Bootstrap cdf of Ty Gy(t, Fy) = BE IT, <0) 
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therefore requires that 
P 
Gy(t, Fy) => Gy(t, Fo), 


uniformly in the statistic ¢ and for all Fo in the space of permitted cdfs. 

Clearly, Fy must be consistent for Fo. Additionally, smoothness in the dgp Fo(x) is 
needed, so that Fy(x) and Fo(x) are close to each other uniformly in the observations 
x for large N. Moreover, smoothness in Gy(-, F), the cdf of the statistic considered as 
a functional of F, is required so that Gy(-, Fy) is close to Gy(-, Fo) when N is large. 

Horowitz (2001, pp. 3166-3168) gives two formal theorems, one general and one 
for iid data, and provides examples of potential failure of the bootstrap, including 
estimation of the median and estimation with boundary constraints on parameters. 

Subject to consistency of Fy for Fo and smoothness requirements on Fo and Gy, 
the bootstrap leads to consistent estimates and asymptotically valid inference. The 
bootstrap is consistent in a very wide range of settings. 


11.4.3. Edgeworth Expansions 


An additional attraction of the bootstrap is that it allows for asymptotic refinement. 
Singh (1981) provided a proof using Edgeworth expansions, which we now introduce. 

Consider the asymptotic behavior of Zy = }_; X;/ ~N, where for simplicity X; are 
standardized scalar random variables that are iid [0, 1]. Then application of a central 
limit theorem leads to a limit standard normal distribution for Zy. More precisely, Zy 
has cdf 


Gy(z) = PriZy < z] = O(z)+ O(N~"”), (11.12) 


where ®(.) is the standard normal cdf. The remainder term is ignored and regular 
asymptotic theory approximates Gy(z) by Goo(z) = P(z). 

The CLT leading to (11.12) is formally derived by a simple approximation of the 
characteristic function of Zy, Efe'*”"], where i = —V/1. A better approximation 
expands this characteristic function in powers of N~'/*. The usual Edgeworth expan- 
sion adds two additional terms, leading to 


gi(Z) P 82(Z) 


JN. N 


Gy(z) = Pr[Zy < z] = Oz) + + O1n~?”), (11.13) 


where gj(z) = —(z? — 1)¢(z)«3/6, (-) denotes the standard normal density, «3 is the 
third cumulant of Zy, and the lengthy expression for g2(-) is given in Rothenberg 
(1984, p. 895) or Amemiya (1985, p. 93). In general the rth cumulant «x, is the rth 
coefficient in the series expansion In(E[e’*“” ]) = pa k,(is)' /r! of the log charac- 
teristic function or cumulant generating function. 

The remainder term in (11.13) is ignored and an Edgeworth expansion approximates 
Gy(z, Fo) by Goolz, Fo) = (z) + N! g1) + N~!g0(z). If Zy is a test statistic 
this can be used to compute p-values and critical values. Alternatively, (11.13) can be 
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inverted to 


hi) i h(z) 
JN N 


for functions hı(z) and h2(z) given in Rothenberg (1984, p. 895). The left-hand side 
gives a modified statistic that will be better approximated by the standard normal than 
the original statistic Zy. 

The problem in application is that the cumulants of Zy are needed to evaluate the 
functions g)(z) and go(z) or hı (z) and f(z). It can be very difficult to obtain analytical 
expressions for these cumulants (e.g., Sargan, 1980, and Phillips, 1983). The bootstrap 
provides a numerical method to implement the Edgeworth expansion without the need 
to calculate cumulants, as shown in the following. 


Pr [z a < e| ~ &(z), (11.14) 


11.4.4. Asymptotic Refinement via Bootstrap 


We now return to the more general setting of Section 11.4.1, with the additional as- 
sumption that Ty has a limit normal distribution and usual VN asymptotics apply. 

Conventional asymptotic methods use the limit cdf G.o(t, Fo) as an approximation 
to the true cdf Gay(t, Fo). For J N-consistent asymptotically normal estimators this 
has an error that in the limit behaves as a multiple of N~!/*. We write this as 


Gy(t, Fo) = Gult, Fo) + O(N~"”), (11.15) 


where in our example Ga (t, Fo) = ®(¢). 
A better approximation is possible using an Edgeworth expansion. Then 


Bilt, Fo) — 82(t, Fo) 
JN N 
Unfortunately, as already noted, the functions g;(-) and g2(-) on the right-hand side 

can be difficult to construct. 
Now consider the bootstrap estimator G y(t, Fy). An Edgeworth expansion yields 
git, Fy) go(t, Fy) 
JN N 


see Hall (1992) for details. The bootstrap estimator Gy(t, Fy) is used to approximate 
the finite-sample cdf Gy(t, Fo). Subtracting (11.16) from (11.17), we get 


Gy(t, Fo) = Goo(t, Fo) + + O(N~?””). (11.16) 


Gy(t, Fy) = Goolt, Fy) + + O(N~*?); (11.17) 


Gut, Fv) — Gn(t, Fo) = [Goolt, Fy) — Goolt, Fo)] (11.18) 
[gi(t, Fv) — git, Fo)] “1 
O(N). 
+ TR + O(N) 


Assume that Fy is vN consistent for the true cdf Fo, so that Fy — Fy = O(N"). 
For continuous function Gœ the first term on the right-hand side of (11.18), 
[Goo(t, Fy) — Go(t, Fo)], is therefore O(N~'/*), so Gy(t, Fy) — Gn(t, Fo) = 
O(N-"/?), 

The bootstrap approximation Gy(t, Fy) is therefore in general no closer asymptot- 
ically to Gy(t, Fo) than is the usual asymptotic approximation G.o(t, Fo); see (11.15). 
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Now suppose the statistic Ty is asymptotically pivotal, so that its asymptotic dis- 
tribution G% does not depend on unknown parameters. Here this is the case if Ty is 
standardized so that its limit distribution is the standard normal. Then G,,(t, Fy) = 
Goo(t, Fo), so (11.18) simplifies to 


Gy(t, Fy) — Gu(t, Fo) = N~'"Lei(t, Fy) — g(t, Fo)] + O(N’). (11.19) 


However, because Fy — Fo = O(N!) we have that [g\(t, Fy) — gi(t, Fo)] = 
O(N~'/*) for gı continuous in F. It follows upon simplification that Gy(t, Fy) = 
Gy(t, Fo) + O(N~!). The bootstrap approximation Gy(t, Fy) is now a better asymp- 
totic approximation to Gy(t, Fo) as the error is now O(N =) 

In summary, for a bootstrap on an asymptotically pivotal statistic we have 


Gy(t, Fo) = Gu(t, Fy) + O(N), (11.20) 


an improvement on the conventional approximation Gy(t, Fo) = Goo(t, Fo) + 
O(N-"/), 

The bootstrap on an asymptotically pivotal statistic therefore leads to an improved 
small-sample performance in the following sense. Let œ be the nominal size for a test 
procedure. Usual asymptotic theory produces t-tests with actual size œ + O(N~'/7), 
whereas the bootstrap produces t-tests with actual size a + O(N~!). 

For symmetric two-sided hypothesis tests and confidence intervals the bootstrap on 
an asymptotically pivotal statistic can be shown to have approximation error O(N~*/*) 
compared to error O(N!) using usual asymptotic theory. 

The preceding results are restricted to asymptotically normal statistics. For chi- 
squared distributed test statistics the asymptotic gains are similar to those for sym- 
metric two-sided hypothesis tests. For proof of bias reduction by bootstrapping, see 
Horowitz (2001, p. 3172). 

The theoretical analysis leads to the following points. The bootstrap should be from 
distribution Fy consistent for Fo. The bootstrap requires smoothness and continuity in 
Fo and Gy, so that a modification of the standard bootstrap is needed if, for example, 
there is a discontinuity because of a boundary constraint on the parameters such as 
0 > 0. The bootstrap assumes existence of low-order moments, as low-order cumu- 
lants appear in the function gı in the Edgeworth expansions. Asymptotic refinement 
requires use of an asymptotically pivotal statistic. The bootstrap refinement presented 
assumes iid data, so that modification is needed even for heteroskedastic errors. For 
more complete discussion see Horowitz (2001). 


11.4.5. Power of Bootstrapped Tests 


The analysis of the bootstrap has focused on getting tests with correct size in small 
samples. The size correction of the bootstrap will lead to changes in the power of tests, 
as will any size correction. 

Intuitively, if the actual size of a test using first-order asymptotics exceeds the nom- 
inal size, then bootstrapping with asymptotic refinement will not only reduce the size 
toward the nominal size but, because of less frequent rejection, will also reduce the 
power of the test. Conversely, if the actual size is less than the nominal size then 
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bootstrapping will increase test power. This is observed in the simulation exercise of 
Horowitz (1994, p. 409). Interestingly, in his simulation he finds that although boot- 
strapping first-order asymptotically equivalent tests leads to tests with similar actual 
size (essentially equal to the nominal size) there can be considerable difference in test 
power across the bootstrapped tests. 


11.5. Bootstrap Extensions 


The bootstrap methods presented so far emphasize smooth \N-consistent asymp- 
totically normal estimators based on iid data. The following extensions of the boot- 
strap permit for a wider range of applications a consistent bootstrap (Sections 11.5.1 
and 11.5.2) or a consistent bootstrap with asymptotic refinement (Sections 11.5.3- 
11.5.5). The presentation of these more advanced methods is brief. Some are used in 
Section 11.6. 


11.5.1. Subsampling Method 


The subsampling method uses a sample of size m that is substantially smaller than 
the sample size N. The subsampling may be with replacement (Bickel, Gotze, and van 
Zwet, 1997) or without replacement (Politis and Romano, 1994). 

Replacement subsampling provides subsamples that are random samples of the pop- 
ulation, rather than random samples of an estimate of the distribution such as the sam- 
ple in the case of a paired bootstrap. Replacement subsampling can then be consistent 
when failure of the smoothness conditions discussed in Section 11.4.2 leads to in- 
consistency of a full sample bootstrap. The associated asymptotic error for testing or 
confidence intervals, however, is of higher order of magnitude than the usual 0(N 71/23 
obtained when a full sample bootstrap without refinement can be used. 

Subsample bootstraps are useful when full sample bootstraps are invalid, or as a 
way to verify that a full sample bootstrap is valid. Results will differ with the choice of 
subsample size. And there is a considerable increase in sample error because a smaller 
fraction of the sample is being used. Indeed, we should have (m/N) — Oand N — oo. 
Politis, Romano, and Wolf (1999) and Horowitz (2001) provide further details. 


11.5.2. Moving Blocks Bootstrap 


The moving blocks bootstrap is used for data that are dependent rather than indepen- 
dent. This splits the sample into r nonoverlapping blocks of length /, where rl > N. 
First, one samples with replacement from these blocks, to give r new blocks, which 
will have a different temporal ordering from the original r blocks. Then one estimates 
the parameters using this bootstrap sample. 

The moving blocks method treats the randomly drawn blocks as being independent 
of each other, but allows dependence within the blocks. A similar blocking was ac- 
tually used by Anderson (1971) to derive a central limit theorem for an m-dependent 
process. The moving blocks process requires r — oo as N — oo to ensure that we 
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are likely to draw consecutive blocks uncorrelated with each other. It also requires the 
block length l — oo as N — oo. See, for example, Götze and Künsch (1996). 


11.5.3. Nested Bootstrap 


A nested bootstrap, introduced by Hall (1986), Beran (1987), and Loh (1987), is 
a bootstrap within a bootstrap. This method is especially useful if the bootstrap is 
on a Statistic that is not asympotically pivotal. In particular, if the standard error of 
the estimate is difficult to compute one can bootstrap the current bootstrap sample 
to obtain a bootstrap standard error estimate sj* goot and form t* = @* —6)/ 55* Boot? 
and then apply the percentile-t method to the bootstrap replications tř, ..., t%. This 
permits asymptotic refinements where a single round of bootstrap would not. 

More generally, iterated bootstrapping is a way to improve the performance of 
the bootstrap by estimating the errors (i.e., bias) that arise from a single pass of the 
bootstrap, and correcting for these errors. In general each further iteration of the boot- 
strap reduces bias by a factor NT! if the statistic is asymptotically pivotal and by a 
factor N~'/? otherwise. For a good exposition see Hall and Martin (1988). If B boot- 
straps are performed at each iteration then B* bootstraps need to be performed if there 
are k iterations. For this reason at most two iterations, called a double bootstrap or 
calibrated bootstrap, are done. 

Davison, Hinkley, and Schechtman (1986) proposed balanced bootstrapping. This 
method ensures that each sample observation is reused exactly the same number of 
times over all B bootstraps, leading to better bootstrap estimates. For implementation 
see Gleason (1988), whose algorithms add little to computational time compared to 
the usual unbalanced bootstrap. 


11.5.4. Recentering and Rescaling 


To yield an asymptotic refinement the bootstrap should be based on an estimate F of 
the dgp Fo that imposes all the conditions of the model under consideration. A leading 
example arises with the residual bootstrap. 

Least-squares residuals do not sum to zero in nonlinear models, or even in lin- 
ear models if there is no intercept. The residual bootstrap (see Section 11.2.4) based 
on least-squares residuals will then fail to impose the restriction that E[u;] = 0. The 
residual bootstrap should instead bootstrap the recentered residual 7; — n, where 
ū = N! i u;. Similar recentering should be done for paired bootstraps of GMM 
estimators in overidentified models (see Section 11.6.4). 

Rescaling of residuals can also be useful. For example, in the linear regression 
model with iid errors resample from (NV /(N — K ))!/°@7; since these have variance s. 
Other adjustments include using the standardized residual w; /y (1 — hj;)s?, where hj; 
is the ith diagonal entry in the projection matrix X(X'X)"!X’. 


11.5.5. The Jackknife 


The bootstrap can be used for bias correction (see Section 11.2.8). An alternative re- 
sampling method is the jackknife, a precursor of the bootstrap. The jackknife uses N 
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deterministically defined subsamples of size N — 1 obtained by dropping in turn each 
of the N observations and recomputing the estimator. 

To see how the jackknife works, let On denote the estimate of 6 using all N obser- 
vations, and let en 1 denote the estimate of 6 using the first (N — 1) observations. 
If (11.7) holds then E[@y] = 0 +ay/N + by/N? + O(N- `) and E[6y_ i] =0+ 
an/(N — 1) + by/(N — 1)? + O(N~*), which implies E[N y — (N — 1y] = 
6+ O(N7?). Thus Now —(N- On. ı has smaller bias than Oy. 

The estimator can be more variable, however, as it uses less of the data. As an 
extreme example, if ê = y then the new estimator is simply yy, the Nth observation. 
The variation can be reduced by dropping each observation in turn and averaging. 

More formally then, consider the estimator ð ofa parameter vector 0 based on a 
sample of size N from iid data. For i = 1,..., N sequentially delete the ith observa- 
tion and obtain N jacknife replication estimates @;_;) from the N jackknife resamples 
of size (N — 1). The jacknife estimate of the bias of 0 is (N — 1)(@ — 8), where 
ð =N! > Bi is the average of the N jacknife replications ci. The bias appears 
large because of multiplication by (N — 1), but the differences Oi — 6) are much 
smaller than in the bootstrap case since a jackknife resample differs from the original 
sample in only one observation. 

This leads to the bias-corrected jackknife estimate of 0: 


rack = 0 —(N — 1)(6 —8) (11.21) 
= NO — (N = 10: 


This reduces the bias from O(N~!) to O(N~*), which is the same order of bias re- 
duction as for the bootstrap. It is assumed that, as for the bootstrap, the estimator is 
a smooth ./N-consistent estimator. The jackknife estimate can have increased vari- 
ance compared with ð, and examples where the jackknife fails are given in Miller 
(1974). 

A simple example is estimation of ø? from an iid sample with y; ~ [u, 07]. The es- 
timate €? = N~! X; (yi — 3)’, the MLE under normality, has E[G”] = o?(N — 1)/N 
so that the bias equals o?/N, which is O(N~'). In this example the jackknife estimate 
can be shown to simplify to G7, = (N — D7! $; (yi — 3), so one does not need not 
to compute N separate estimates C This is an unbiased estimate of o7, so the bias 
is actually zero rather than the general result of O(N~). 

The jackknife is due to Quenouille (1956). Tukey (1958) considered application to 
a wider range of statistics. In particular, the jackknife estimate of the standard error 
of an estimator 6 is 


E _ 4 
SCracxlO] = ae Xê- - | (11.22) 
i=l 
Tukey proposed the term jackknife by analogy to a Boy Scout jackknife that solves 
a variety of problems, each of which could be solved more efficiently by a specially 
constructed tool. The jackknife is a “rough and ready” method for bias reduction in 
many situations, but it is not the ideal method in any. The jackknife can be viewed as a 
linear approximation of the bootstrap (Efron and Tibsharani, 1993, p. 146). It requires 
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less computation than the bootstrap in small samples, as then N < B is likely, but is 
outperformed by the bootstrap as B — oo. 

Consider the linear regression model y = XG + u, with B = (X’X) X’y. An ex- 
ample of a biased estimator from OLS regression is a time-series model with lagged 
dependent variable as regressor. The regression estimator based on the ith jackknife 
sample (Xj), y(_i)) is given by 

Bon = XX XR 
= [XX — xx) (X'y — x; yi) 
=p- [XX] xii = xB). 


The third equality avoids the need to invert X/_;,X,—i) for each į and is obtained using 


[X(_) Xa)! xix ; [Xi pX] 
Laia] x 


[XX]! = [Xi X] 


Here the pseudo-values are given by N B = {N = 1) EN and the jackknife estimator 
of 6 is given by 


a oa | eae 
Brack = NB- N- Dx TBE: (11.23) 
i=1 


An interesting application of the jackknife to bias reduction is the jackknife IV 
estimator (see Section 6.4.4). 


11.6. Bootstrap Applications 


We consider application of the bootstrap taking into account typical microeconometric 
complications such as heteroskedasticity and clustering and more complicated estima- 
tors that can lead to failure of simple bootstraps. 


11.6.1. Heteroskedastic Errors 


For least squares in models with additive errors that are heteroskedastic, the standard 
procedure is to use White’s heteroskedastic-consistent covariance matrix estimator 
(HCCME). This is well known to perform poorly in small samples. When done cor- 
rectly, the bootstrap can provide an improvement. 

The paired bootstrap leads to valid inference, since the essential assumption that 
(Yi, X;) is iid still permits V[u;|x;] to vary with x; (see Section 4.4.7). However, it 
does not offer an asymptotic refinement because it does not impose the condition that 
E[u; |x; ] = 0. 

The usual residual bootstrap actually leads to invalid inference, since it assumes 
that u;|x; is iid and hence erroneously imposes the condition of homoskedastic er- 
rors. In terms of Section 11.4 theory, F is then inconsistent for F. One can specify a 
formal model for heteroskedasticity, say u; = exp(z;a)e;, where £; are iid, obtain esti- 
mate exp(z;@), and then bootstrap the implied residuals €;. Consistency and asymptotic 
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refinement of this bootstrap requires correct specification of the functional form for the 
heteroskedasticity. 

The wild bootstrap, introduced by Wu (1986) and Liu (1988) and studied further 
by Mammen (1993), provides asymptotic refinement without imposing such structure 
on the heteroskedasticity. This bootstrap replaces the OLS residual 7; by the following 
residual: 


I-V57, ~ —0.61800; with probability + ~ 0.7236, 


2 
us 


i= pL 1-5 ti, ~ 1.61800; with probability 1 — a2 ~ 0.2764. 


Taking expectations with respect to only this two-point distribution and perform- 
ing some algebra yields E[w;*] = 0, Efw*?] = R, and E[u*] =}. Thus 7 leads 
to a residual with zero conditional mean as desired, since Flu; |t, x;] = 0 implies 
E[u;*|x;] = 0, while the second and third moments are unchanged. 

The wild bootstrap resamples have ith observation (y;, x;), where y* = x! 3 + uF. 
The resamples vary because of different realizations of u*. Simulations by Horowitz 
(1997, 2001) show that this bootstrap works much better than a paired bootstrap when 
there is heteroskedasticity and works well compared to other bootstrap methods even 
if there is no heteroskedasticity. 

It seems surprising that this bootstrap should work because for the ith observa- 
tion it draws from only two possible values for the residual, —0.6180u; or 1.61807;. 
However, a similar draw is being made over all N observations and over B bootstrap 
iterations. Recall also that White’s estimator replaces E[u?] by @?, which, although 
incorrect for one observation, is valid when averaged over the sample. The wild boot- 


strap is instead drawing from a two-point distribution with mean 0 and variance 0°. 


11.6.2. Panel Data and Clustered Data 


Consider a linear panel regression model 
in aa, GAs 
Yit = Wir O+Uir, 


where 7 denotes individual and t denotes time period. Following the notation of Sec- 
tion 21.2.3, the tilda is added as the original data y;, and x;; may first be transformed 
to eliminate fixed effects, for example. We assume that the errors 7;, are independent 
over i, though they may be heteroskedastic and correlated over t for given i. 

If the panel is short, so that T is finite and asymptotic theory relies on N —> ov, 
then consistent standard errors for Ô can be obtained by a paired or EDF bootstrap 
that resamples over i but does not resample over t. In the preceding presentation w; 
becomes [y;1, Xi1,---, YiT, Xir] and we resample over i and obtain all T observations 
for the chosen i. 

This panel bootstrap, also called a block bootstrap, can also be applied to the 
nonlinear panel models of Chapter 23. The key assumptions are that the panel is short 
and the data are independent over i. More generally, this bootstrap can be applied 
whenever data are clustered (see Section 24.5), provided cluster size is finite and the 
number of clusters goes to infinity. 
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The panel bootstrap produces standard errors that are asymptotically equivalent to 
panel robust sandwich standard errors (see Section 21.2.3). It does not provide an 
asymptotic refinement. However, it is quite simple to implement and is practically very 
useful as many packages do not automatically provide panel robust standard errors 
even for quite basic panel estimators such as the fixed effects estimator. Depending 
on the application, other bootstraps such as parametric and residual bootstraps may be 
possible, provided again that resampling is over i only. 

Asymptotic refinement is straightforward if the errors are iid. More realistically, 
however, ti, will be heteroskedastic and correlated over t for given i. The wild boot- 
strap (see Section 11.6.1) should provide an asymptotic refinement in a linear model 
if the panel is short. Then wild bootstrap resamples have (i, t)th observation Ge. Wir) 
where y* = Wi/O-+U,,, die = Vi — W/O and T, is a draw from the two-point distri- 
bution given in Section 11.6.1. 


11.6.3. Hypothesis and Specification Tests 


Section 11.2.6 focused on tests of the hypothesis 6 = 6. Here we consider more gen- 
eral tests. As in Section 11.2.6, the bootstrap can be used to perform hypothesis tests 
with or without asymptotic refinement. 


Tests without Asymptotic Refinement 


A leading example of the usefulness of the bootstrap is the Hausman test (see Sec- 
tion 8.3). Standard implementation of this test requires estimation of ve — 0], where 
@ and @ are the two estimators being contrasted. Obtaining this estimate can be diffi- 
cult unless the strong assumption is made that one of the estimators is fully efficient 
under Ho. The paired bootstrap can be used instead, leading to consistent estimate 


S a. ee 1 z Ap Py a eee po ZNE I ee 
VeBootl@ — 0] = Bol SIG, —0,)— (0 -= 6 J0, -0,)- 0 -0 V, 
b=1 


where @ = Bo! $, 0; and 6 = B"! S, 9,. Then compute 
H = @ — 8) (Val — 61) @ — 8) (11.24) 


and compare to chi-square critical values. As mentioned in Chapter 8, a generalized 
inverse may need to be used and care may be needed to ensure chi-square critical 
values are obtained using the correct degrees of freedom. 

More generally, this approach can be used for any standard normal test or chi-square 
distributed test where implementation is difficult because a variance matrix must be 
estimated. Examples include hypothesis tests based on a two-step estimator and the 
m-tests of Chapter 8. 


Tests with Asymptotic Refinement 


Many tests, especially those for fully parametric models such as the LM test and IM 
test, can be simply implemented using an auxiliary regression (see Sections 7.3.5 and 
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8.2.2). The resulting test statistics, however, perform poorly in finite samples as docu- 
mented in many Monte Carlo studies. Such test statistics are easily computed and are 
asymptotically pivotal as the chi-square distribution does not depend on unknown pa- 
rameters. They are therefore prime candidates for asymptotic refinement by bootstrap. 

Consider the m-test of Ho : Efim;(y;|x;, 0)] = 0 against H, : E[m;(y;|x;, 8)] 4 0 
(see Section 8.2). From the original data estimate t) by ML, and calculate the test 
statistic M. Using a parametric bootstrap, resample y; from the fitted conditional den- 
sity f(y; |x; 0), for fixed regressors in repeated samples, or from f (y; |x} 0). Compute 
M;.b=1,..., B, in the bootstrap resamples. Reject Ho at level œ if the original cal- 
culated statistic M exceeds the œ quantile of Mj,b = 1,..., B. 

Horowitz (1994) presented this bootstrap for the IM test and demonstrated with 
simulation examples that there are substantial finite-sample gains to this bootstrap. A 
detailed application by Drukker (2002) to specification tests for the tobit model sug- 
gests that conditional moment specification tests can be easily applied to fully para- 
metric models, since any size distortion in the auxiliary regressions can be corrected 
through bootstrap. 

Note that bootstrap tests without asymptotic refinement, such as the Hausman test 
given here, can be refined by use of the nested bootstrap given in Section 11.5.3. 


11.6.4. GMM, Minimum Distance, and Empirical Likelihood in 
Overidentified Models 


The GMM estimator is based on population moment conditions E[h(w;, 0)] = 0 
(see Section 6.3.1). In a just-identified model a consistent estimator simply solves 
No! >; hw, 6) = 0. In overidentified models this estimator is no longer feasible. 
Instead, the GMM estimator is used (see Section 6.3.2). 

Now consider bootstrapping, using the paired or EDF bootstrap. For GMM in an 
overidentified model N7! >; h(w;, 6) Æ 0, so this bootstrap does not impose on the 
bootstrap resamples the original population restriction that E[h(w;, 0)] = 0. As a re- 
sult even if the asymptotically pivotal t-statistic is used there is no longer a bootstrap 
refinement, though bootstraps on @ and related confidence intervals and t-test statis- 
tics remain consistent. More fundamentally, the bootstrap of the OIR test (see Sec- 
tion 6.3.8) can be shown to be inconsistent. We focus on cross-section data but similar 
issues arise for panel GMM estimators (see Chapter 22) in overidentified models. 

Hall and Horowitz (1996) propose corer tine this by recentering. Then the boot- 
strap is based on h*(w;, 6) = h(w;, 6)- NT 5, h(w;, 0) and asymptotic refinements 
can be obtained for statistics based on 0 including the OIR test. 

Horowitz (1998) does similar recentering for the minimum distance estimator (see 
Section 6.7). He then applies the bootstrap to the covariance structure example of 
Altonji and Segal (1996) discussed in Section 6.3.5. 

An alternative adjustment proposed by Brown and Newey (2002) is to not recenter 
but to instead resample the observations w; with probabilities that vary across observa- 
tions rather t than using equal weights 1/N. Specifically, let Pr[w*= = w;] = 7, where 
Ti =(1+ nN h; ), h; = h(w;, 6), and A maximizes >>; nd + N h; ). The motivation is 
that the probabilities 77; equivalently are the solution to an empirical likelihood (EL) 
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problem (see Section ns of maximizing }_; In 7; with respect to 7, . . . , 2 subject 
to the constraints }); x;h; = 0 and `; 7; = 1. This empirical likelihood bootstrap 
of the GMM estimator therefore imposes the constraint X}; 7;h; = 0. 


One could instead work directly with EL from the beginning, letting @ be the EL 
estimator rather than the GMM estimator. The advantage of the Brown and Newey 
(2002) approach is that it avoids the more challenging computation of the EL estimator. 
Instead, one needs only the GMM estimator and solution of the concave programming 
problem of minimizing >>, In(1 + Ns hj). 


11.6.5. Nonparametric Regression 


Nonparametric density and regression estimators converge at rate less than VN and 
are asymptotically biased. This complicates inference such as confidence intervals (see 
Sections 9.3.7 and 9.5.4). 

We consider the kernel regression estimator m(xq) of m(xo) = EL y|x = xo] for ob- 
servations (y, x) that are iid, though conditional heteroskedasticity is permitted. From 
Horowitz (2001, p. 3204), an asymptotically pivotal statistic is 

m(Xo) — m(xo) 


’ 


Sin(xo) 


where m(xo) is an undersmoothed kernel regression estimator with bandwidth h = 
o(N~'/) rather than the optimal h* = O(N7~!/°) and 


h= SF ea) 
ho = WFG À Yo MaK (=) , 


where F (x0) is a kernel estimate of the density f(x) at x = xo. A paired bootstrap 
resamples (y*, x*) and forms t = [m}(xo) — MX) Seo), be where Six), p is com- 


puted using bootstrap sample kernel estimates m;(x;) and Fë (xo). The percentile-t 
confidence interval of Section 11.2.7 then provides an asymptotic refinement. For a 
symmetrical confidence interval or symmetrical test at level æ the error is o((Nh7!)) 
rather than O((Nh~')) using first-order asymptotic approximation. 

Several variations on this bootstrap are possible. Rather than using undersmoothing, 
bias can be eliminated by directly estimating the bias term given in Section 9.5.2. 
Also rather than using sx, , the variance term given in Section 9.5.2 can be directly 
estimated. 

Yatchew (2003) provides considerable detail on implementing the bootstrap in non- 
parametric and semiparametric regression. 


Mo)? 


11.6.6. Nonsmooth Estimators 


From Section 11.4.2 the bootstrap assumes smoothness in estimators and statistics. 
Otherwise the bootstrap may not offer an asymptotic refinement and may even be 
invalid. 

As illustration we consider the LAD estimator and extension to binary data. The 
LAD estimator (see Section 4.6.2) has objective function >>; |y; — x; 6| that has 
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discontinuous first derivative. A bootstrap can provide a valid asymptotic approx- 
imation but does not provide an asymptotic refinement. For binary outcomes, the 
LAD estimator extends to the maximum score estimator of Manski (1975) (see 
Section 14.7.2). For this estimator the bootstrap is not even consistent. 

In these examples bootstraps with asymptotic refinements can be obtained by us- 
ing a smoothed version of the original objective function for the estimator. For ex- 
ample, the smoothed maximum score estimator of Horowitz (1992) is presented in 
Section 14.7.2. 


11.6.7. Time Series 


The bootstrap relies on resampling from an iid distribution. Time-series data therefore 
present obvious problems as the result of dependence. 

The bootstrap is straightforward in the linear model with an ARMA error structure 
and resampling the underlying white noise error. As an example, suppose y; = Bx; + 
ur, Where us = Pu;—1 + £ and £; is white noise. Then given estimates B and p we 
can recursively compute residuals as €; = U, — Pu;_1 = Yı — xÊ — P(Y-1 — X;-1B). 
Bootstrapping these residuals to give €;7,, t = 1,..., T, we can then recursively com- 
pute 7* = pa*_, +@* and hence y* = Bx, +7%*. Then regress y* on x, with AR(1) 
error. An early example was presented by Freedman (1984), who bootstrapped a dy- 
namic linear simultaneous equations regression model estimated by 2SLS. Given lin- 
earity, simultaneity adds little problems. The dynamic nature of the model is handled 
by recursively constructing y; = f(yj_,, X;, U;), where už are obtained by resampling 
from the 2SLS structural equation residuals and yj = yo. Then perform 2SLS on each 
bootstrap sample. 

This method assumes the underlying error is iid. For general dependent data without 
an ARMA specification, for example, nonstationary data, the moving blocks bootstrap 
presented in Section 11.5.2 can be used. 

For testing unit roots or cointegration special care is needed in applying the boot- 
strap as the behavior of the test statistic changes discontinuously at the unit root. 
See, for example, Li and Maddala (1997). Although it is possible to implement a 
valid bootstrap in this situation, to date these bootstraps do not provide an asymptotic 
refinement. 


11.7. Practical Considerations 


The bootstrap without asymptotic refinement can be a very useful tool for the applied 
researcher in situations where it is difficult to perform inference by other means. This 
need can vary with available software and the practitioner’s tool kit. The most common 
application of the bootstrap to date is computation of standard errors needed to conduct 
a Wald hypothesis test. Examples include heteroskedasticity-robust and panel-robust 
inference, inference for two-step estimators, and inference on transformations of es- 
timators. Other potential applications include computation of m-test statistics such as 
the Hausman test. 
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The bootstrap can additionally provide an asymptotic refinement. Many Monte 
Carlo studies show that quite standard procedures can perform poorly in finite sam- 
ples. There appears to be great potential for use of bootstrap refinements, currently 
unrealized. In some cases this could improve existing inference, such as use of the 
wild bootstrap in models with additive errors that are heteroskedastic. In other cases it 
should encourage increased use of methods that are currently under-utilized. In partic- 
ular, model specification tests with good small-sample properties can be implemented 
by bootstrapping easily computed auxiliary regressions. 

There are two barriers to the use of the bootstrap. First, the bootstrap is not always 
built into statistical packages. This will change over time, and for now constructing 
code for a bootstrap is not too difficult provided the package includes looping and the 
ability to save regression output. Second, there are subtleties involved. Asymptotic re- 
finement requires use of an asymptotically pivotal statistic and the simplest bootstraps 
presume iid data and smoothness of estimators and statistics. This covers a wide class 
of applications but not all applications. 


11.8. Bibliographic Notes 


The bootstrap was proposed by Efron (1979) for the iid case. Singh (1981) and Bickel and 
Freedman (1981) provided early theory. A good introductory statistics treatment is by Efron 
and Tibsharani (1993), and a more advanced treatment is by Hall (1992). Extensions to 
the regression case were considered early on; see, for example, Freedman (1984). Most of 
the work by econometricians has occurred in the past 10 years. The survey of Horowitz 
(2001) is very comprehensive and is well complemented by the survey of Brownstone and 
Kazimi (1998), which considers many econometrics applications, and the paper by MacKinnon 
(2002). 


Exercises 


11-1 Consider the model y = «œ + x+ €, where a, 6, and x are scalars and € ~ 
N[0, o?]. Generate a sample of size N = 20 witha = 2, B = 1, and o? = 1 and 
suppose that x ~ N[2, 2]. We wish to test Hp : 8 = 1 against Ha : 8 4 1 at level 
0.05 using the t-statistic t = (B — 1)/se[B]. Do as much of the following as your 
software permits. Use B = 499 bootstrap replications. 

(a) Estimate the model by OLS, giving slope estimate B. 

(b) Use a paired bootstrap to compute the standard error and compare this to 
the original sample estimate. Use the bootstrap standard error to test Ho. 

(c) Use a paired bootstrap with asymptotic refinement to test Ho. 

(d) Use a residual bootstrap to compute the standard error and compare this to 
the original sample estimate. Use the bootstrap standard error to test Ho. 

(e) Use a residual bootstrap with asymptotic refinement to test Ho. 


11-2 Generate a sample of size 20 according from the following dgp. The two regres- 
sors are generated by x ~ x2(4) — 4 and xə ~ 3.5 + U[1, 2]; the error is from a 
mixture of normals with u ~ A/[0, 25] with probability 0.3 and u ~ M[O, 5] with 
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probability 0.7; and the dependent variable is y= 1.3x; + 0.7xX2 + 0.5u. 

(a) Estimate by OLS the model y = Bo + 1X1 + Boxe + u. 

(b) Suppose we are interested in estimating the quantity y = 6; + £2 from the 
data. Use the least-squares estimates to estimate this quantity. Use the 
delta method to obtain approximate standard error for this function. 

(c) Then estimate the standard error of 7 using a paired bootstrap. Compare 
this to se[y] from part (b) and explain the difference. For the bootstrap use 
B= 25 and B= 200. 

(d) Now test Ho : y = 1.0 at level 0.05 using a paired bootstrap with B = 999. 
Perform bootstrap tests without and with asymptotic refinement. 


Use 200 observations from the Section 4.6.4 data on natural logarithm of health 

expenditure (y) and natural logarithm of total expenditure (x). Obtain OLS esti- 

mates of the model y= a + $x + u. Use the paired bootstrap with B = 999. 

(a) Obtain a bootstrap estimate of the standard error of B. 

(b) Use this standard error estimate to test Ho : 8 = 1 against Ha : B #1. 

(c) Do a bootstrap test with refinement of Hp : 6 = 1 against Ha : 6 # 1 under 
the assumption that u is homoskedastic. 

(d) If u is heteroskedastic what happens to your method in (c)? Is the test still 
asymptotically valid, and if so does it offer an asymptotic refinement? 

(e) Do a bootstrap to obtain a bias-corrected estimate of £. 
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Simulation-Based Methods 


12.1. Introduction 


The nonlinear methods presented in the preceding chapters do not require closed-form 
solutions for the estimator. Nonetheless, they rely considerably on analytical tractabil- 
ity. In particular, the objective function for the estimator has been assumed to have a 
closed-form expression, and the asymptotic distribution of the estimator is based on a 
linearization of the estimating equations. 

In the current chapter we present simulation-based estimation methods. The treat- 
ment of ML estimation in Chapter 5 presumed that the density f(y|x, 0) has a closed- 
form expression. If there is no closed-form solution, ML estimation may still be 
possible if we instead use a good approximation Fix, 0) of f(y|x, 0) to form the 
likelihood function. A common reason for lack of a closed-form expression for the 
density is the presence of an intractable expectation in the definition of f(y|x, 0). For 
example, in a random coefficients model it may be difficult to integrate out the ran- 
dom parameters. If the expectation is replaced by a Monte Carlo approximation the 
resulting estimator is called a simulation-based estimator. A similar simulation ap- 
proach can be applied to method of moments estimation based on a moment, such as 
the conditional mean, for which there is no closed-form solution. In the method of 
moments case it can be possible to obtain consistent parameter estimates with much 
less simulation than is necessary for consistency in the ML case. 

These estimation methods are computer intensive because they make extensive use 
of Monte Carlo sampling methods. Their use raises questions of accuracy of approxi- 
mations, efficiency of computation, and the sampling properties of the estimators that 
use such approximations. 

Section 12.2 gives motivating examples for simulation-based estimation. Sec- 
tion 12.3 covers the basics of computing integrals, as an expectation with respect 
to a continuous random variable is an integral. Sections 12.4 and 12.5 present max- 
imum simulated likelihood estimation and simulated moment-based estimation; Sec- 
tion 12.6 deals with indirect inference. These estimators require simulators, detailed 
in Section 12.7, and pseudo-random numbers, detailed in Section 12.8. 
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12.2. Examples 


We consider examples where the conditional density of y given regressors x and pa- 
rameters 0 is an integral 


FOl, 6) = if h(yIx, 8, u)g(u)du, (12.1) 


where the functional forms of (-) and g(-) are known and u denotes a random variable, 
not necessarily an error term, that needs to be integrated out. If there is no analytical 
solution for the integral, and hence no closed-form expression for the likelihood func- 
tion, then simulation-based estimation methods are warranted. 


12.2.1. Random Parameters Models 


A random parameters model or random coefficients model permits regression pa- 
rameters to vary across individuals according to some distribution. A fully parametric 
random parameters model specifies the dependent variable y; conditional on regres- 
sors x; and given parameters ~y; to have conditional density f(9;|x;, y;), where ~y; are 
iid with density g(7;|@). Inference is based on the density of y; conditional on x; and 
given 0, 


f(y|x, 8) = I SOx, Veled. (12.2) 


This integral will not have a closed-form solution except in some special cases. A 
common specification is to assume normally distributed random parameters, with y; ~ 
Np, £]. Then y; = u + > !/2u;, where u; ~ M[0, I] and we can rewrite (12.2) in 
the form (12.1), where @ is a vector containing yz and the distinct components of &, 
and g(u) is the V’[0, I] density. 

A simple example of a random parameters model is neglected heterogeneity. Then 
often just one parameter, usually the intercept, is assumed to be random and the integral 
is a one-dimensional integral that is easily approximated numerically. More generally, 
however, the dimension of the integral may be high. 

Leading examples of random parameters and unobserved heterogeneity include (1) 
normally distributed random parameters in multinomial logit models (the random pa- 
rameters logit model; see Chapter 15), (2) gamma distributed unobserved heterogene- 
ity in Weibull duration models (see Chapter 19), (3) gamma distributed unobserved 
heterogeneity in Poisson count data models (see Chapter 20), and (4) individual- 
specific random effects in panel data models (see Chapter 21). Closed-form solutions 
for the resulting marginal density after integration over the distribution of heterogene- 
ity are available in example 3 and for the linear model under normality in example 4. 
However, for examples | and 2 and many nonlinear applications of example 4 closed- 
form solutions are not available. 
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12.2.2. Limited Dependent Variable Models 


A limited dependent variable (LDV) is a dependent variable that is observed only 
over part of its range, owing to censoring and truncation. Then the density of the ob- 
served variable involves integrals that may not have a closed-form expression. 

A leading class of LDV models are discrete choice models, detailed in Chapters 
14 and 15. We introduce discrete choice models here because they have been the focus 
of the econometrics literature on simulation-based estimation. 

As an example, consider consumer choice among three mutually exclusive alterna- 
tives, such as among three different durable goods, only one of which is chosen by 
the individual. Suppose the consumer maximizes utility, and let the utilities of alterna- 
tives 1, 2, and 3 be given by U1, U2, and U3, respectively. The utilities U1, U2, and U3 
are not observed. Instead, we observe only a discrete outcome variable y = 1, 2, or 3 
depending on which alternative is chosen. 

Suppose alternative 1 is chosen, because it has the highest utility. Then the proba- 
bility mass function is pı = Pr[y = 1], where 


Pi = Pr[U; — U2 > 0, Uı — U3 > 0] 
= Pr[(x; — X2)/G+e, — 62 > 0, (ki — x3)/G+e1 — £3 > 0), 


if we make the common assumption (see Section 15.5.1) that Uj = x} 6 + £j, j = 
1, 2,3, where the regressor x measures the different attributes of the three goods and 
the error € can range over (—oo, 00). Defining u; = U — U2 and u2 = U; — U3, we 
have that 


CO [0.6] 
A= f f EET E A (12.3) 
0 0 


where g(u1, u2), or more formally g(u1, u2|x, 0), is the bivariate density of (u1, u2), or 
equivalently 


CO CO 
pı al f Iu) > 0, uz > 0]g(u1, u2)duyduo, (12.4) 


0 Yy œ 


where 1[A] is the indicator function equal to 1 if event A happens and equal to 0 
otherwise. 

The integral (12.4) is of the form (12.1). Because the integral is over only part of 
the range of (u1, u2) (see (12.3)) a closed-form solution may not exist, even though we 
know that f f g(uı, U2)du,du, = | if integration is over the entire range of (u1, u2). 

In particular, if the errors ¢ are normally distributed, as in the multinomial probit 
model, the integral (12.3) is over the positive orthant of a bivariate normal distribution. 
There is no closed-form solution for p, and hence no tractable expression for the den- 
sity f (y|x, 0) exists. In practice the dimension of the integral can be very high, making 
numerical approximation difficult, because for choice among m mutually exclusive al- 
ternatives the integral has dimension m — 1. Until simulation-based estimators were 
developed researchers either used models with m < 4 or chose other error distribu- 
tions such as that leading to the much more restricted multinomial logit model. 
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12.2.3. ML Estimation 


For simplicity consider the MLE. Assume independence over observations and that y 
has conditional density f(y|x, 0). 

The complication in the preceding two examples is that ML estimation is not practi- 
cal as there is no closed-form expression for f(y|x, 0), which is defined by an integral 
that does not simplify. Instead, we replace the integral by a numerical approximation 
F y|x, 0), and we maximize 


N 
InLy(@) =} In fix, 0) 


i=1 


with respect to 0. The estimator will be consistent and have the same asymptotic dis- 
tribution as the MLE if F y|x, 0) is a good approximation to f(y|x, 0). 

The resulting first-order conditions are usually nonlinear and are solved by iterative 
methods. Because Fvilxi, 0) varies with i and 0, evaluation of the gradient using 
numerical derivatives will require at least Ngr evaluations, where N is the sample 
size, q is the dimension of 0, and r is the number of iterations. For example, with 
1,000 observations, 10 parameters, and 50 iterations there are at least 500,000 function 
evaluations. 

This standard computational demand for nonlinear models now needs to be mul- 
tiplied by the number of evaluations needed to compute an adequate approximation 
to the integral f(y|x, 0). Clearly, methods that require relatively few evaluations are 
desired. 


12.2.4. Bayesian Methods 


Bayesian methods are given a separate treatment in Chapter 13. They involve compu- 
tation of integrals that appear similar to (12.2), but they go one step further and obtain 
the (posterior) distribution of parameters rather than a point estimate such as the MLE. 


12.3. Basics of Computing Integrals 


We consider the integral 


b 
1=f f(x)dx, (12.5) 


where f(-) is continuous on [a, b], and the bounds of the integral need not be finite, 
so a = —œ and/or b = œ are possible. In this section x is initially a scalar and is 
used to denote the variable being integrated out. In regression applications integration 
is often with respect to a vector that is denoted u since x then denotes the regressors 
(see (12.1)). It is assumed that the integral exists, an important qualification that needs 
to be checked as approximation methods will yield a finite estimate of J even if the 
integral diverges. 
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We first present numerical integration or quadrature, useful for low-dimensional 
integrals. This is followed by Monte Carlo integration, which works better for high- 
dimensional integrals and is the focus of this chapter. 

The material in this section pertains to the implementation phase of simulation- 
based estimation; therefore, some readers may prefer to read it after covering 
Sections 12.4—12.6. 


12.3.1. Deterministic Numerical Integration 


An integral can be interpreted as an area or a volume measure. Deterministic numer- 
ical integration or quadrature replaces the volume by a series of slices of smaller 
volumes that are then added up. Formally this involves evaluating the integrand at sev- 
eral points and taking a weighted sum of these values. The prefix deterministic is used 
to indicate that this method of approximation of an integral does not entail simulation. 


Simpson’s Rule 


By the definition of an integral, 
TS cape F(xj)Ax;, (12.6) 


where the range of [a, b] of x is split into (n + 1) points, x9 < x1 < <+- < Xn, and 
n — oo. Standard approximation methods are refinements of (12.6) that provide more 
accurate approximations for finite n. We present results for equally spaced points, 
though the methods can be generalized to evaluation at points that are not equally 
spaced. For simplicity we assume that f(x) can be evaluated at the limit points a 
and b. 

The midpoint rule evaluates at the midpoint x; = i j-1 + xj) of the interval 
[xj-1, xj] and sums n rectangles that have base (b — a)/n and height f(x;). Thus 
I is approximated by 


n 


a b — 
M=). —* f@)). (12.7) 


j=l 


The trapezoidal rule is an improvement that draws a straight line between f(x j—1) and 
f(x;) and sums n trapezoids that have base (b — a)/n and average height (f(x ;-1) + 
f (x;))/2. Thus I is approximated by 


r= 3 (b —a) f(xj-1) + LO) 


z , (12.8) 


j=l 


Simpson’s rule uses a quadratic curve among three successive points f(xj—1), f(x;), 
and f(x ;+1), whereas the trapezoidal rule used a line between two successive points. 
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This leads to the approximation 


(ee pee 12.9 
a z wit ei): (12.9) 
where n is even, w; = 4 if j is odd, and w; = 2 if j is even, except wọ = wẹ, = 1. 
Further generalization to permit a polynomial of degree p among p + 1 successive 
points is possible. 

Error bounds for these approximations increase as a power function of the range 
of integration, b — a, and decrease as a power function of the number of intervals. 
For Simpson’s rule, |Zs — I| < M4(b — a) /180n*, where M4 is the maximum abso- 
lute value of the fourth derivative of x on [a, b]. For the trapezoidal rule, |J~; — I| < 
M>(b — a)? /12n?, where M, is the maximum absolute value of the second derivative 
of x on [a, b]. Clearly, the number of intervals needs to increase with the range of x, 
and one should test for sensitivity to the number of intervals. 

Simpson’s rule and related rules can work well for definite integrals over a bounded 
interval, but problems can clearly arise with indefinite integrals because of problems 
in evaluating in the tails. For example, suppose [a, b] = [0, 00). Then in choosing 
Xn there is a trade-off because the upper bound x, should be large, but then the dis- 
tance between evaluation points is large. At the least one should test for sensitivity to 
increases in xX). 


Gaussian Quadrature 


Gaussian quadrature, where quadrature is an alternative name for numerical inte- 
gration, was proposed by Gauss in 1814. It provides a rule for good choice of the 
evaluation points xj, no longer equally spaced, and is especially useful for evaluating 
indefinite integrals. 

We first reexpress the integral (12.5) as 


d 
Poy w(x)r(x)dx, (12.10) 


where w(x) is usually one of the following three functions, depending on the range 
of x: Gauss—Hermite quadrature sets w(x) = e~* and is used for [c, d] = (—oo, oo), 
Gauss-Laguerre quadrature sets w(x) = e~* and is used when [c, d] = (0, œœ), and 
Gauss—Legendre quadrature sets w(x) = 1 and is used when [c, d] = [—1, 1]. 

In the simplest case (12.10) can be obtained from (12.5) by defining r(x) = 
J (x)/w(x). More generally, a transformation of x may be needed so that, for example, 
the range [2, oo) in (12.5) becomes [0, co) in (12.10). Some routines permit the user 
to simply provide f(x) and the range of integration and automatically take care of any 
necessary transformations. 

Gaussian quadrature approximates the integral (12.10) by the weighted sum 


m 


To = X wjr(x;), (12.11) 


j=l 
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where the researcher chooses m; the m points of evaluation x; and the weights w; are 
given in books such as Abramowitz and Stegun’s (1971) or in computer code such as 
that provided in Press et al. (1993). 

The theory behind the approximation is based on the orthogonal polynomials of 
w(x), denoted p;(x), j =0,...,m, that satisfy 


d 
f w(x)pj(x)pe(xjdx =0, j#Æk, j,k=0,...,m. 


If additionally f g w(x) pj (x)dx = | then the polynomials are said to be orthonormal. 
The approximation (12.11) is exact if r(x) is a polynomial of order 2m — 1 or less, 
so the approximation works best if r(x) in (12.10) is well approximated by a polyno- 
mial of order 2m — 1. A good choice of the number of evaluation points m requires 
experimentation, but many applications use m no more than 20 or 30. 

As an example consider Gauss—Hermite quadrature, commonly used in econo- 
metrics since integration is often over (—0o, oo). For w(x) = e™ the orthogonal poly- 
nomials p;(x) are the Hermite polynomials H;(x), which in the orthonormal form are 
generated using the recursion Hj4\(x) = /2/(j + DxAj(x) — /7/G + DAj-1@), 
j=1,...,m, where H_; = Oand Hp = x~ !/4, The m abscissas xj are obtained as the 
m roots to Hm(x) = 0 and, for orthonormal Hermite polynomials, the weights w; = 
1/ [J H;-1(x;)}]. As already noted x; and wyj for given m are readily available in tables 
or computer code. 

For definite integrals Gauss—Legendre quadrature usually performs better than 
Simpson’s rule. The real advantage of Gaussian quadrature, however, is for indefi- 
nite integrals. Note that if integration is over (—0o, oo) it may be possible by change 
of variable techniques to transform to an integral over (0, oo) and use Gauss—Laguerre 
quadrature rather than Gauss—Hermite quadrature. 

There are many additional deterministic methods for computing integrals, including 
Laplace approximation (Tierney, Kass, and Kadane, 1989). 


12.3.2. Integration by Direct Monte Carlo Sampling 


Monte Carlo integration provides an alternative to deterministic numerical integration. 
In general the Monte Carlo integral estimate of J = fi f(x)dx is 


S 
mc = ye fo’, (12.12) 
s=l1 


where x!,...,x° are S uniform draws of x in the range [a, b]. Compared to the mid- 


point rule we evaluate f(x) at S randomly chosen points rather than n deterministic 
midpoints. 

We focus on regression applications such as those given in Section 12.2. Then 
integration arises because we wish to obtain an expected value E[h(x)], say, where 
the expectation is with respect to a random variable x that has, say, pdf g(x). In the 
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continuous case we wish to evaluate 


b 
Emon = f h(x)g(x)dx, (12.13) 


where throughout this chapter it is assumed that E[h(x)] < 00, that is, the integral con- 
verges. Then E[h(x)] can be estimated by the direct Monte Carlo integral estimate 


s 
Tomc = E[h(x)] = S! Y hx’), (12.14) 
s=l1 
where {x*, s = 1,..., S} isa Monte Carlo sample of S pseudo-random numbers from 
the density g(x), obtained using methods given later in Section 12.8. The estimate 
(12.14) evaluates h(x) using draws of x from the density g(x), whereas the estimate 
(12.12) evaluates h(x)g(x) using uniform draws of x as in (12.12). An advantage of 
(12.14) is that it can be applied to indefinite integrals, whereas obtaining uniform draws 
in (12.12) is problematic if the limits a or b are unbounded. 

The estimate E [h(x)] is an average of the function f(-) evaluated at each of the 
random draws x*. Equivalently, E [h(x)] is an average of the random variable h(x;), 
and its properties as S — oo can be obtained if we can apply a law of large numbers 
and a central limit theorem. Here x’ is iid, so A(x*) is iid and we can apply Kolmogorov 
LLN (see Appendix A, Theorem A.8) since the existence of E[/(x)] has already been 
assumed. It follows that 


E[h(x)] 5 E[A(x)] as S > o. 


Also, since h(x‘) is iid, the variance of E [A(x)] equals S~!V[A(x)] assuming V[h(x)] 
exists. The approximation is likely to be good for moderate size S if S~!V[h(x‘)] is 
small. 


12.3.3. Integral Computation Example 


Suppose x ~ N [0, 1], and we wish to compute the mean 
E [x] = (V27)! Í x exp (—x°/2) dx 


[0,0] 


and the moment E[exp(— exp(x))], defined as the integral 
E [exp (— exp(x))] = (J/2n)7! f exp (— exp(x)) exp (—x?/2) dx. 


An analytical expression for E[x] exists and yields E[x] = 0. By contrast an analyt- 
ical solution for E[exp (= exp(x))] does not exist. Before seeking a numerical approxi- 
mation, we first confirm that the integral does indeed converge. Since exp (— exp(x)) is 
strictly positive and monotonically decreasing with maximum value of 1 it follows that 
| exp (— exp(x)) | < 1, so E[exp (- exp(x)) | < E[1] = 1 and the integral converges. 

These one-dimensional integrals are easily calculated using a deterministic numeri- 
cal approximation. For example, consider using the midpoint rule with n = 20 equally 
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spaced evaluations between x9 = —5 and x29 = 5. Then 


20 10 

x] =(V2n)7! d, 50% xP (— 34/2), 
=1 
20 


10 
E [exp (— exp(x))] = (V27)! Da exp(Xj)) exp (—X;/2), 
=] 


where x; = —5.25 + j/2. This yields E[x] =0to many decimal places, as expected, 
whereas Elexp(— exp(x))] = 0.38175656. The latter estimate changes little, not until 
the eighth decimal place, if instead we do n = 200 evaluations between —10 and 10. 
Clearly deterministic numerical approximations work well here. 

These integrals are also easily calculated using a Monte Carlo approximation, with 


as 


1 


Ale 


S 


a 


S 
5 ewer). 


E [exp (— exp(x))] = 


a 


where x" is the sth draw of S draws from the M [0, 1] distribution, and a method 
to make such draws is given in Appendix B. Table 12.1 gives estimates of E[x] 
and Elexp(— exp(x))] for various numbers of simulations S. Observe the tendency 
of the estimators to stabilize as S — oo, and to go to their respective true values of 
O and 0.38175656, where the latter is obtained by deterministic numerical approxi- 
mation. However, even with S = 10° the estimate E [x] still differs from zero in the 
fourth decimal place. Here VI ELx]] = S~!V[x'] = 1/S since V[x*] = 1, so even with 
S = 10° the standard deviation of E [x] is a relatively large 0.001. Alternative methods 
that yield a Monte Carlo approximation with lower variance are given in Section 12.7. 


Table 12.1. Monte Carlo Integration: Example for x 
Standard Normal 


S = Number of simulations Ẹ [x] E [exp (—exp(x))] 


10 0.145 0.336 
25 —0.209 0.435 
50 0.050 0.369 
100 —0.120 0.409 
500 —0.059 0.398 
1,000 0.005 0.382 
10,000 —0.007 0.383 
100,000 —0.000 0.382 
1,000,000 —0.000 0.381 
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12.3.4. Higher Dimensional Integrals 


Higher dimensional integrals can be evaluated using either deterministic or Monte 
Carlo integration, with the latter method preferred as the dimension increases. 

Deterministic integration is best done using multivariate Gaussian quadrature or, if 
the limits of integration are not too complicated, by reducing an m-dimensional inte- 
gral to a series of m one-dimensional integrals evaluated using, say, Gaussian quadra- 
ture. However, from the definition of the integral in (12.6) it is clear that the number 
of evaluations will have to go up by the power m. For example, if 20 function evalua- 
tions are needed for a one-dimensional integral, then a five-dimensional integral may 
require 5% or 95 trillion function evaluations. Such high precision may not be needed 
in an estimation setting where similar computations are being done for each individ- 
ual observation and then summed, but even then the number of evaluations will rise 
substantially with the dimension of the integral. 

Performing Monte Carlo integration in higher dimensions is straightforward: Just 
define x in (12.13) and (12.14) be a vector, and make draws from the multivariate den- 
sity g(x). There is apparently no curse of dimensionality. One should bear in mind, 
however, that simple Monte Carlo integration will not work if the integrand is strongly 
peaked, and it is possible that such peaks may become more prominent in higher di- 
mensions. In particular, for the discrete choice example in Section 12.2.2 the integrand 
in (12.4) may be nonzero over only a small part of the range of (u, v), a complication 
pursued in Section 12.7. Moreover, drawing from a multivariate distribution can be 
more difficult than drawing from a univariate distribution. 


12.4. Maximum Simulated Likelihood Estimation 
We now consider application of these ideas to ML estimation when no analytical ex- 
pression is available for the density. The key result is that simulation can lead to an 


estimator with the same distribution as the MLE, provided that the number of simula- 
tion draws made to compute the density for each observation goes to infinity. 


12.4.1. Simulators 


Suppose the conditional density f(y|x, 0) for an observation involves an intractable 
integral. Specifically, suppose that, as in (12.1), 


f(ilxi, 9) = f row, 0, u;)g(uj)du;, (12.15) 


which needs to be estimated if there is no closed-form solution. 
The direct simulator for f(y:|X;, 0) is the obvious Monte Carlo integral estimate 


X LS 
FIIs, wis, 6) = 9 hCilxi, 0, u), (12.16) 
s=1 
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where u;s is a vector of S draws uj, s = 1,..., S, that are independent draws from 
g(u;). This simply averages h(y;|x;, 0, u;) over the S draws. From Section 12.3.2, F, 
is unbiased for f; and is consistent for f; as the number of draws S —> oo. 

Simulators other than the direct simulator can be used, and these are detailed in Sec- 
tion 12.7. These can yield an estimate Fi that better approximates f; for a finite number 
of draws by, for example, permitting correlation among the draws provided they still 
have marginal distribution g(u;). More generally, then, a simulator for f(y;|x;, 0) is 
a Monte Carlo estimate 


px ES 
POIs ws, 0) =< D fOrilxi 0, u), (12.17) 
s=1 
where uj, s = 1,..., S, are S draws with marginal density g(u;) but not necessarily 


independent over s. To be useful the simulator F; 3 fi as S — oo. This is likely if 
the subsimulator /(-) is an unbiased simulator with the property that 


E[fOlx, 0, u’)] = FOIX, 0). (12.18) 


A desirable property of a simulator is that fi be differentiable in 0, so that stan- 
dard iterative gradient methods can be used to compute the estimate of 0. To elimi- 
nate “chatter” caused by simulation and ensure numerical convergence, the underlying 
Monte Carlo draws used to construct F should not be redrawn as @ changes across 
iterations. 


12.4.2. MSL Estimator 


Given independence over i, the maximum likelihood estimator Bit maximizes 
InLy (0) = aie , In f (i |x;, 0). The maximum simulated likelihood (MSL) estima- 
tor Oms_ instead maximizes the log-likelihood based on a simulated estimate of the 
density, or 
“~ N ~ 
In Ly (8) = In f (yilXi, Uis, 0), (12.19) 
i=l 

where the simulator FO) is defined in (12.17). If FO) is differentiable in 0 then Orist 
can be computed using the standard gradient methods of Chapter 10, with either ana- 
lytical or numerical derivatives used. 


12.4.3. Distribution of the MSL Estimator 


From the general consistency proof method outlined in Section 5.3.2, the MSL esti- 
mator will have the same probability limit as the ML estimator if the approximating 
objective function N~! In Ly (0) has the same probability limit as the original objec- 
tive function N~! In Ly (0). This occurs if In fi -l fi +. 0, which in turn happens if 
F: — fi 4+ Oas S > œ. 

Even if the MSL estimator is consistent, it is possible that simulation error will in- 
flate the variance of the MSL estimator compared to the ML estimator. As an example 
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of a formal statement of conditions under which the MSL estimator is fully efficient 
we give the following proposition, which is a rephrasing of a theorem in Gouriéroux 
and Monfort (1991). 


Proposition 12.1 (Distribution of MSL Estimator) (Gouriéroux and Monfort 
1991): Assume the following: 


(i) The data are from a simple random sample from a dgp with conditional density 
Ff (|x, 90) that satisfies the regularity conditions so that the ML estimator is 
consistent and asymptotically normal with limit variance matrix A™! (0o), where 


i 


(ti) The density f is estimated using the simulator F in (12.17) with F unbiased 
for f. 


Then the maximum simulated likelihood estimator defined in (12.19) is asymptoti- 
cally equivalent to the ML estimator if S, N —> œ and VN /S — 0, and it has a limit 
normal distribution with 


VN (Ousi — 90) S N [0, A-!(8o)] - (12.20) 


3? In f(yi1X;, 0) 
An) = -pim | e2 ae 


The MSL estimator is actually consistent under the weaker condition that S, N > 
oo. This is satisfied if, for example, S = N 04 /a for some constant a. However, then 
J/N/S = aN°! —> ov, so the MSL estimator is not fully efficient according to Propo- 
sition 12.1. By the usual first-order Taylor series expansion the limit distribution of 
VN (msi — ĝo) is a matrix multiple of N! >, 31n F: /4|>, , which depends on 
both variability of ð ln f;/00 and simulation error in the approximation F; i; Proposi- 
tion 12.1 says that for this simulation error to disappear asymptotically the number of 
draws S must increase with sample size at rate in excess of JN. 

The variance matrix of the MSL estimator requires estimation of A(@o). It is eas- 
iest to use a simulated variant of the BHHH estimate defined in Section 5.5.2. Since 
d1n f;/00 = (df; /08) /f;, the BHHH estimate for the information matrix is 


1 5 af:(0)/90 Ə f;@)/30' 
NE FO FO 
Because there is no closed-form solution for f; and 0f;/00 this expression cannot 


be computed. So we replace f; by the simulator Fi defined in (12.17), yielding the 
simulated estimate of the asymptotic variance 


N S Zs D S EPR 7 =l 
FA Desai 9 fr (0)/30 X's- 3 fF (0)/30 
V[0 = = = ; 12.21 
ú o ELFO ELFO )) eee 


where fe (6) = F yilXi, Uj, Öms). Alternative estimates of the variance matrix can be 
obtained by similar adaptation of the Hessian estimate and sandwich estimates defined 
in Section 5.5.2. 


B= 


i=l 
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An important practical issue concerns the number of simulations. One can increase 
the number of simulations as the sample size increases, but the level or the absolute 
value of S remains indeterminate. If there is little difference in the estimates using 
2,400 simulations, say, rather than 2,600, then we might take this as an indication that 
2,400 simulations is an adequate number. Suppose now that the sample increases four 
fold. By how much should we increase the number of simulations? Proposition 12.1 
suggests that we should more than double S to more than 4,800, so that the ratio VN /S 
decreases toward zero. However, notice that in this case we may not be sure if JN /S, 
here equal to 1/30 if S = 2,400 and N = 6,400, say, is sufficiently close to zero. So 
the question of whether one has done enough simulations is difficult to answer. Many 
practitioners rely on rough indicators of convergence of point estimates, informally 
based on checking the gradients of Ly (0). A formal test-based approach to choosing 
S is discussed in Hajivassiliou (2000). 


12.4.4. Asymptotic Bias-Adjusted MSL 


The MSL estimator is inconsistent, or asymptotically biased, when the number of sim- 
ulations S < oo. This bias arises for finite § because In F, is biased for In f; even if 
the simulator F: is unbiased for f;, as the consequence of taking the natural logarithm. 
Thus N~!In Ly (0) and N~'In Ly (6) have different probability limits for finite S. This 
motivates a search for alternative simulation-based estimators, since we can never set 
S = œ and it may be computationally expensive to set S to be large. 

The obvious approach is to find an unbiased simulator for the log-density In fj, 
rather than for f;, but in practice this is not possible. Instead, in this section we present 
a bias-corrected version of MSL, and in the following section we present an alternative, 
less efficient estimator than MSL that is consistent for finite S. 

Gouriéroux and Monfort (1991) give an expression for the bias of the MSL estima- 
tor. The inconsistency of the MSL estimator for fixed S comes from the fact that then 
In Fis an inconsistent estimator of In f. A way of reducing the inconsistency is to use 
a bias-adjusted log-likelihood function. Write 


Inf =In[f +(f— A. 


Taking a second-order Taylor expansion around In f yields 


OCR =f) 
f 2e an 
Integrating with respect to the density of u, and solving for In f, yields 


LEW(f — f» 
Be 
assuming Fis an unbiased simulator so that ELA = f. This expression makes it clear 
that a simulator F with small variance leads to lower bias. 

A bias-corrected estimator uses an adjusted log-likelihood based on the right-hand 
side of (12.22). For the simulator (12.17), F equals S7! Da Fs and E.G — pA 
equals S$! S Eul(f* — f)?]. Given draws independent over s the latter can be 


Inf ~Inf+ 


In f ~ Ey{In f] + (12.22) 
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approximated by S7! yf — pr Then (12.22) yields the first-order asymptotic 
bias-corrected MSL estimator, 9gcms_, which maximizes 


N 


InLg.w (0) = > Ç FOX, ujs, 9) + 


i=1 


7 > 2 
1 ES [FO x, us, 0) — Fix, ws, 0)] 
25 fOilXi, Uis, 6) i 


where Fi |x;,ujs, 0) = S7! te Foi , Xi, u;, 0). The usefulness of this bias-reduction 
technique will vary from case to case, as the assumption that bias is small may not 
always hold. 


12.4.5. Unobserved Heterogeneity Example 


Suppose that y; ~ \V[6;, 1], where the scalar parameter 6; varies across individuals 
with 0; = 6 + u;, with u; representing unobserved heterogeneity that is assumed to 
have a known distribution. The density of y conditional on u is simply 


fOlu, 0) = 0 — u} /2}. (12.23) 


1 
exp {—(y 
20 
However, inference on 0 needs to be based on the marginal density of y (i.e., marginal 
with respect to u), which requires integrating out u. Here we assume that u has density 


g(u) = e “ exp(—e™“), (12.24) 


a skewed distribution that has nonzero mean and for simplicity does not depend on 
unknown parameters. 

Maximum likelihood estimation is not possible as the marginal density f(y|6), 
which equals f f(y|@, u)g(u)du, has no closed-form solution. We instead use the MSL 
estimator using the the direct simulator in (12.16), so that ÖMsL maximizes 


3 1 
InLy(@) = = mele D = exp { (y; — 0 z) l (12.25) 
where uï, s = 1,..., S, are draws from the extreme value density g(u;) in (12.24). 


The MSL estimator Osii is the solution to the first-order conditions 


= 0, (12.26) 


ae 1 >> ee = — u$) exp {—(y; — 0 — u$)? /2} 
ra ae (yj — 0 — uf)?/2} 


upon some simplification. There is no closed-form solution for 6, but standard iterative 
methods can be used to compute Osik 

Consistency of the MSL estimator requires the number of draws S —> oo, in addi- 
tion to the usual sample size N — oo, so the method is potentially computationally 
intensive. The MSL estimator is then asymptotically normally distributed as usual, 
with asymptotic variance most easily estimated using the BHHH estimator (12.21), 
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Table 12.2. Maximum Simulated Likelihood Estimation: Example 


Number of Simulations S=1 S=10 S=100 S=1,000 S= 10,000 


MSL estimate 6 1.0416 1.0594 1.1775 1.1845 1.1828 
Standard error (.0968)  (.1093)  (.1453) ~— (1448) (.0091) 
In L(8) —136.31 —174.38 —190.44 —192.43  —192.35 
which yields 


-1 


N e -_ ea SAD i 
VlômsL] = S| 510i — ÔmsL wj)exp {— (Qi — Omst sre 


ae 1 exp {— (Yi — Oust = us)? /2} 


=1 


(12.27) 


This estimator is fully efficient. 

To illustrate we consider a sample {y,,..., yioo} of size N = 100 generated from 
the model of (12.23) and (12.24) with 0 = 1. Table 12.2 gives estimates as the number 
of draws S increases. For small S the MSL estimator is inconsistent. By S = 10,000 
the estimator Ose has stabilized, though the estimated standard error bounces around 
quite a bit. The simulated log-likelihood decreases as S increases but eventually sta- 
bilizes. This decrease is expected as the simulator is unbiased for f(y|@) but is biased 
upward for In f (y|0) since by Jensen’s inequality In ELf(16] > Ein F18] because 
the natural logarithm function is globally concave; see Appendix A (Section A.8). 


12.5. Moment-Based Simulation Estimation 
The simulation approach to estimation when there is no closed-form expression for 
the objective function can be extended to estimators other than the MLE. Furthermore, 


in some cases it is possible to obtain consistent parameter estimates with only a few 
simulations per observation, though there is then an efficiency loss. 


12.5.1. Simulated m-Estimators 


Consider an m-estimator that has as its objective function (see Section 5.2.2) 


1 wv 
On 0) = 5 2 qO'i» Xi, 9). 
Maximum likelihood is the special case q(y, x, 0) = In f(y |x, 8). 
Suppose there is no closed-form expression for q(-), but a simulated estimate is 
available. Then a simulated m-estimator minimizes 


N TAa 
Qn (0) = W > q(Yi, Xi, Uis, 9), (12.28) 
i=l 
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where, similar to Section 12.4.1, Gg; is an estimate of q; based on a vector u;s 
of S draws uj, s= 1,..., S, from an appropriate distribution. Usually, qg) = 
S'S. GQilx;, 0, uf), where uf is the sth draw. 

The simulated m-estimator will be consistent if the m-estimator is consistent and 
additionally 


plim Oy (0) = plim Qy (0), (12.29) 


since from Section 5.3 the necessary condition for consistency of the original m- 
estimator is that plim Q y (0) is maximized at 0 = 6. Here the first plim is with re- 
spect to all stochastic variables, including the simulated draws u;s, whereas the second 
plim does not depend on u;s. 

Condition (12.29) is satisfied if the simulator is such that g; — qi 4+ Oas S > œ, 
since then N'Y; Gq — N'Y; qi 0. This was the assumption made in Sec- 
tion 12.4. Furthermore, the simulated m-estimator should have the same limit dis- 
tribution as the m-estimator if, as in Section 12.4, S increases with sample size so that 
J/N/S —> 0. This requires many simulations. 


12.5.2. Reducing the Number of Simulations 


Now suppose the simulator gj is not only consistent but is unbiased. Then by applica- 
tion of a law of large numbers, and for simplicity suppressing stochastic variables other 
than the simulated draws, plim On (0) = lim N7! X; Baal | San a = 
plim Qx (0) and condition (12.29) is satisfied. Thus the simulated m-estimator is con- 
sistent with as little as one draw of u; per observation, provided Ey,,[ 9] = qi. 

Unfortunately, this result is difficult to implement, as in applications it is rarely 
possible to find an unbiased simulator for g;. For example, with ML estimation it can 
be possible to find an unbiased simulator for the density f;, but it is not possible to 
find an unbiased simulator for In f;. Similarly, for NLS estimation it can be possible 
to find an unbiased estimator for the conditional mean, but it is not possible to find an 
unbiased simulator for the squared error, which involves the square of the conditional 
mean. 

In some cases this result can be implemented, however, if the estimator is a method 
of moments or GMM estimator rather than an m-estimator. 


12.5.3. Method of Simulated Moments 
Suppose theory leads to a conditional moment condition 
E[m(y;, Xi, 90)|x;] = 0, (12.30) 


where m(-) is a scalar for simplicity. Let w; denote instruments, a function of x; and 
possibly @o, that satisfy 


E[w;m(y;, Xi, 9o)] = 0. (12.31) 
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The method of moments estimator Ovum (see Chapter 6.3.1) minimizes 


1 i ie 
On (0) = È Dwi, Xi, J È Dwi, Xi, J ; (12.32) 


where for simplicity the just-identified case that dim[w;] = dim[@] is assumed. Results 
do generalize to the overidentified case, but the notation is more cumbersome as a 
weighting matrix then needs to be introduced and estimation is by GMM. 

The method of moments estimator is consistent and has limit normal distribution 
with variance matrix that depends in part on the choice of instruments w;. An exam- 
ple is nonlinear regression, where m(y, x, 0) = y — E[y|x] is the error term and the 
conditional mean E[y|x] is a specified function of x and 8. Then the best choice of in- 
strument is w = 0E[y|x]/00|g, if the error is homoskedastic, since then the method of 
moments estimator has the same first-order conditions as those for the NLS estimator. 

Now suppose there is no closed-form expression for m(y, x, 0). For example, a non- 
linear regression model may lack a closed-form expression for the conditional mean. 
Instead, m(y, x, 0) is an integral 


pice ar oe J EE Giang (12.33) 


for some functions h(-) and g(-), that has no closed-form solution. Obtaining a method 
of moments estimator is no longer feasible. 
The method of simulated moments (MSM) estimator @ysm instead minimizes 


Ba IN | ‘TY N 

On (0) =| — D w,M(y;, Xi, Us, 0) E N wimMm(yi, Xi, Wis, O)|, (12.34) 
N i=l N i=l 

where m(y;,X;,U;s, 9) is an unbiased simulator for m(y;,x;, 0) that satisfies the 

condition 


E[mi(y;, Xi, wis, O] = mi, Xi, 9), (12.35) 


and u;s denotes S draws from the marginal density g(u;) and S > 1. Examples of m; 
and unbiased simulator m; are given in the following. 


12.5.4. Distribution of MSM Estimator 


The MSM estimator was proposed by McFadden (1989), who proved the following 
properties for the estimator. 


Proposition 12.2 (Distribution of MSM Estimator) (McFadden 1989): As- 
sume the following: 


(i) The data are from a simple random sample from a dgp, where m(y, X, 99) has zero 
conditional expectation as in (12.30) and w;m(y, x, 99) has zero unconditional 
expectation as in (12.31) and assumptions are satisfied so that the MM estimator 
that minimizes (12.32) is consistent and asymptotically normal. 


(ii) The function m(y, x, 00) is defined by (12.33) and is estimated using the unbiased 
simulator m(y, X, 90) that satisfies (12.35). 
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Then with S fixed the method of simulated moments estimator that minimizes 
(12.34) is consistent and asymptotically normal as N — œ and has a limit normal 
distribution with 


/N(Omsm — 90) $ N [0, A~'(80)B(@9)A™ '(80)'], (12.36) 
where 
Cee D we (12.37) 
ia ae a 
and 
1 N 
B(@o) = plim | XO wi Vin; (Oo) |W}, (12.38) 


i=1 


with the variance V[-] being with respect to both the conditional distribution of yi 
given x; and the draws ujs given after (12.35). 


Before giving a derivation for this proposition we note the following. First, the 
MSM estimator has the remarkable property of being consistent even if S = 1. Second, 
there is an efficiency loss for finite S. The variance matrix for Oum is the same as that 
for Bracks. except that for MM estimation V[m;] in (12.38) is replaced by the smaller 
V[m;]. Third, the efficiency loss caused by simulation disappears as S — ov, since 
then V[m;] = V[m;]. Fourth, as for MM estimation, the MSM estimator with S — oo 
may be inefficient compared to other estimators if the instruments w are poorly chosen. 

Consistency of the MSM estimator requires that condition (12.29) is satisfied for 
On (0) and Qy (0) given in (12.34) and (12.32). By a law of large numbers 


a. ern Pa 
plim W 2 wim; = plim N 3 w; Eu; [M;], 
where the first plim is with respect to all stochastic variables whereas the second 
plim is with respect to all stochastic variables aside from the simulated draws u. Here 
Eu, [M;i] = m; since m; is an unbiased simulator, so 


ESER N 
plim W 5 wm; = plim N! a Wmi. 


i=l i=l 
This in turn implies that plim Q n (0) = plim Qy (0). So Onis is consistent, provided 
0o maximizes plim Qy (0), which is necessary for the original MM estimator to be 


consistent. 
For the limit distribution, differentiating Q xn (0) with respect to @ yields 


14 IAO 1 
GÈ” Win aa )z Ywimi@)=0 


i=1 
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The first matrix is a full-rank square matrix, so equivalently @qsm satisfies the first- 
order conditions 


= w;;(0) =0 
N i=1 


where m;(0) = m;(y;, Xi, Uis, 9). By the usual exact first-order Taylor series expan- 
sion about ĝo 


Yow (4) + Yom a 0-8) =0. 


and hence 


N m. 
VN@ — 60) = — G So wi mm 
i=l 


= 
N 

N" X wimi (Oo). 
oe i=1 


Now Ey [0m(0)/30] = dE, [m(0)] /30 = dm(O)/98, so the first matrix on the right- 
hand side converges to A(@9) given in Proposition 12.2. The second term on the right- 
hand side has a limit normal distribution with mean zero and variance matrix 


1 wv 
B(@_) = plim — wV Mi 0 w., 
(80) = p rps [m; (Oo) lw, 
as in Proposition 12.2, where V[m;(0o)] is a variance with respect to both u;s and the 
distribution of y; given x;. 
Since u;s is independent of y; we have 


Vy,ul7(80)] = Vy [Eu [m(80)]] + Ey [Vu [Mo] 
= Vy [m(@0)] + Ey [Vu [7(o)]] - 


Substitution yields a more detailed definition of B(@o) given in Proposition 12.2. 

Simulation inflates the variance of the MSM estimator because of the term 
E,y[Vulm(@0)]], which goes to zero as S — oo. In the special case that the simula- 
tor is the frequency simulator, it can be shown that Vy yl7(@0)] =U+1/S)V, [m(@o)], 
so that the effect of simulation using the frequency simulator is to inflate the variance 
of the MM estimator by (1 + (1/S))! 


12.5.5. Choosing between MSM and MSL 


The practitioner will weigh the pros and cons of MSL versus MSM. Given that MSM 
is consistent for small S, and further given the difficulty of ensuring that one has set S 
at a large enough value to ensure a good approximation to the MLE, why would MSL 
be ever preferred to MSM? 

First, observe that MSL is in principle straightforward and simple to implement. 
Given the parametric assumptions, the optimal weighting of observations is inherent 
to the MLE method. The MSM, analogous to the GMM, in contrast requires us to work 
with products of weight (or instrumental variable) functions and residuals, and these 
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components may be correlated. The numerical instability of the GMM estimator (with- 
out simulation) has been documented by, for example, Altonji and Segal (1996) (see 
Section 6.3.5). Similarly, Geweke, Keane, and Runkle (1997) and McFadden and Ruud 
(1994) have provided evidence of the instability of the MSM estimator. Nevertheless, 
although simplicity favors MSL, some of the problems associated with ensuring that 
sufficient number of simulations are applied should not be underestimated. 


12.5.6. Unobserved Heterogeneity Example 


We return to the example of Section 12.4.5. Then y; ~ N[@ + u;, 1], where u; has 
density g(u;) given in (12.24). Since E[y; — 0 — u;] = 0, we can estimate 0 by the 
method of moments estimator that solves 


1 N 
a 2 0 — 0 — E[u;]) = 0, (12.39) 


yielding Oym = y — E[ū]. Suppose that E[ū] is unknown. Then we can instead use 
the MSM estimator msm that solves 


Š Š 
poe TADS 5)=0, 12.40 
wu ( sd «) (12.40) 
where u; are iid random draws from the extreme value distribution. 

The estimating equation (12.40) can be solved, yielding 


a 


msm = ¥ — it, (12.41) 


where @ = (NS)! >; >>, wu! is an average over both N and S. More generally, how- 
ever, an iterative method may be needed to compute the MSM estimator. 

The variance of Gries is easily obtained. By construction the simulated draws of u 
are independent of each other and of the original data y, so that Views = V[y]+ 
V[a]. Now V[¥] = (o2 + 1)/N. Since T is the average of NS draws of u, V[a] = 
o,/NS, it follows that 


a 


V[lðmsm] = VIJ] + Via] (12.42) 
o? +1 o? 
© N NS’ 


This can be consistently estimated using 62 = (NS! EX, ES (us — ii)’. 

We consider a sample {1,..., vioo} of size N = 100 generated from the model 
(12.24) with 0 = 1. Table 12.3 gives the MSM estimator as the number of draws 
S — oo. As the number of simulations S increases the MSM estimator approaches 
the method of moments estimate, and the standard error falls. 
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Table 12.3. Method of Simulated Moments Estimation: Example 


Number of Simulations S=1 S$=10 S=100 S=1,000 S=co (MM) 


MSM estimate 6 1.0073 1.1096 1.2012 1.1887 1.1879 
Standard error (.2471) (.1657) (.1681)  (.1676) (.1684) 


12.6. Indirect Inference 


In this section we outline another simulation-based approach to model estimation 
that is sometimes used when one wants to use or estimate a model that is relatively 
simple to estimate, even when the underlying dgp is thought be more complex and 
harder to estimate. There are several variants and interpretations of the approach; see 
Gouriéroux, Monfort, and Renault (1993), Smith (1993), and Gallant and Tauchen 
(1996). The approach has also been called the moment matching approach. Our ex- 
position essentially follows the first of the aforementioned references. 

Suppose that the parametrically specified dgp is denoted by the pdf f (y;0), 0 € 
R41, whose parameters are relatively difficult to estimate. Suppose that we can specify 
an auxiliary model with the dgp f° (y; B), B € R”, which is easier to estimate by the 
quasi-(sometimes also called “pseudo-”) maximum likelihood method. For reasons of 
identification that are further discussed in the following, we assume that the dimension 
of 6 is not smaller than the dimension of 0, that is, r > q. For example, the auxiliary 
model may be an approximation to the exact likelihood, or it may be an exact likeli- 
hood of an approximate model. For a given sample, let B denote the QML estimates. 
Then, by the results covered in Section 5.7, we know that B is in general an inconsis- 
tent estimator of 0, and under some regularity conditions it converges in probability 
to a value called the pseudo-true value, which is a function of 6. The function that 
connects the parameters of the auxiliary model to those of the dgp is called the bind- 
ing function, denoted as h (0). The analytical form of this function may or may not be 
known. Therefore, it may not always be possible to obtain @ = h™! (8) or 6= h-'(A). 

The method of indirect inference can be used to obtain an improved QML estimator 
with a smaller asymptotic bias than B. The idea is to use the model under f (y; 0) to 
generate by simulation pseudo-observations y% and to use the auxiliary model under 
fe (y®; B) to estimate Be, where s refers to the sth simulation. The indirect estimator 
is defined by the solution of 


@ = arg min” -DAB — P), (12.43) 


where Q is a given symmetric positive definite matrix. This estimator is similar to the 
minimum distance estimator considered in Section 6.7. That is, we sequentially gener- 
ate pseudo-observations and estimate the parameters of the auxiliary model based on 
the pseudo-observations. The iterations continue until the quadratic form in (12.43) is 
minimized. A very important point is that the seed that generates the pseudo-random 
observations y is kept unchanged, so that variations in the pseudo-observations 


g P EE S 
across simulations are due to the variation in 3°’. 
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Before further discussion, we consider a simple but specific example involving a 
nonlinear dgp and a linear auxiliary model. The motivation is that the auxiliary model 
should be easy to estimate, and the dgp should be easy to simulate. 

Let the dgp be as follows: 

yi = exp (x;y) + ui, (12.44) 
uj; ~ N [0, oO “| : 


Let the auxiliary model be the following: 


Yi = X; B+ £i, (12.45) 
E&i ~ N [0, a; | : 


Note the following interpretations: 


dE 
` By B (under the auxiliary model), 
x 
ð nE dE 1 
nE[y|x] _ [yix] ss =y (under the dgp). 
ax ax E [yix] 


Therefore, the binding function is yE[y|x] = 6, or y = EŒ[ylx])! B. Note that 
dim[ 6] equals dim[7]. 


Given the data (x;, y;, i = 1,..., N) and the least-squares estimator G, and given 
a N-dimensional pseudo-random draw, denoted u, we generate y® (i =1,...,N) 
using 


yp = exp; b) + up” 


and obtain a revised estimator a =Q% xix.) ! yx Ih which in turn is used to 
generate another set of pseudo-observations. The entire simulation cycle is repeated, 
holding u fixed, until BÀ = Bape” — Â) approaches a constant value to desired 
accuracy. In the present case it is reasonable to set Q equal to either the identity ma- 
trix or X'X, the latter choice implying that prediction from the auxiliary model is a 
modeling objective. The resulting estimate of ~y is the indirect estimator. 

In other applications dim(Q) will exceed dim (8), so a unique value of 0 may not be 
available. Indeed, in the absence of an analytical binding function, we cannot recover 
0, even if the two dimensions are the same. Then one settles for the best indirect 
estimates of the auxiliary model parameters. 

To see the connection between the indirect estimator and moment matching, set 
Q = XX; then B® — ByYx’x(B” — B) = BX — BX GOX — BX), which indi- 
cates that the indirect estimator is “matching” the first moment of distribution. If one 
also wants to match the second moment, the vector G can be augmented by additional 
parameters, such as the variance parameter. Thus one can match several moments if so 
desired. 

Under regularity conditions the indirect estimator is consistent and asymptotically 
normal. The reader is referred to the previously cited works for additional detail. 
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12.7. Simulators 


As in Section 12.3.2 we consider computation of 


I = E[h(x)] = f raswa, (12.46) 


where for simplicity x is often a scalar. As in Section 12.3, x is being used here to 
denote the variable being integrated out, whereas in application sections u denotes the 
variable being integrated out as x denotes the regressors. 

A simulator is a method to compute J. There are many ways to do so, aside from 
direct Monte Carlo integration given in (12.14). Ideally, simulators should be unbiased, 
because many of the results in Sections 12.4 and 12.5 assume an unbiased simulator, 
and smooth so that standard iterative gradient methods can be used. Even then the 
computing time for estimation of empirically interesting models can be a formidable 
obstacle. We present a few of the many clever procedures that have been developed 
to speed up simulation by reducing, for any given number of simulation draws, the 
simulation variance relative to crude methods such as direct Monte Carlo integration. 
A more complete survey is given in Geweke and Keane (2001). 


12.7.1. Frequency Simulator 


We begin with an example, the frequency simulator, that can be used for some discrete 
models. This highlights well some of the complications that can arise in simulation. 

Suppose the function A(x) is an indicator function that takes value 1 if x € A and 0 
otherwise. Then we wish to compute 


I= / 1(x € A)g(x)dx. 


Direct Monte Carlo integration yields the estimate 
A i 
I = — 1(x* € A), 
FREQ = 5 a Q“ € A) 


where xf, s =1,..., S, are S draws from g(x). This is called the frequency simulator 
as it estimates J by the relative frequency with which the S draws of x° fall in A. 

A leading potential application — one that has motivated much of the econometrics 
literature on simulation methods — is the multinomial discrete choice model introduced 
in Section 12.2.2. For a three-alternative model, the probability pı of choosing the first 
alternative is given by (12.3), an integral over the positive orthant of a bivariate normal 
distribution. The frequency simulator P; is then the proportion of draws (uj, u3) from 
the bivariate normal with uj > 0 and u5 > 0. 

The frequency simulator has several limitations. First, it is neither differentiable nor 
continuous in parameters 0, which appear in 1(x € A) and/or g(x). So small changes 
in 0 lead to the same number of draws falling in the positive orthant. For this reason 
McFadden (1989) and Pakes and Pollard (1989) presented a more general asymptotic 
theory that covers such nonsmooth simulators. In practice, however, it is best to use 
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alternative smooth simulators that are differentiable in parameters as this permits 
computation using the usual gradient methods. 

Second, this simulator is very inefficient if only a small fraction of x € A. For ex- 
ample, for a discrete choice model with pı = 0.001, even with 10,000 draws of S the 
estimate pı will be very noisy. Similar problems arise more generally in direct Monte 
Carlo evaluation of (12.46) with continuous h(x) if the probability of drawing x is low 
in regions where h(x) is relatively large. 

Third, this simulator may have problems at the boundary and give an estimate T=0 
or T = 1 even if the model imposes 0 < J < 1 and this condition is necessary for 
model estimation. 


12.7.2. Importance Sampling 


The importance sampling simulator reexpresses the integral (12.46) as 


I= / (=) p(x)dx (12.47) 
p(x) 


= I w(x)p(x)dx, 


where p(x) is a density function chosen so that (a) it is easy to draw from p(x), (b) 
p(x) has the same support as the original domain of integration, and (c) w(x) = 
h(x)g(x)/p(x) is easy to evaluate, is bounded, and has finite variance. We then use 
the direct Monte Carlo integral estimate based on (12.47) rather than (12.46), 


S 


SS 1 : 
hs = 5 2 w(x°), (12.48) 
where x°, s = 1,..., S, are draws from p(x) rather than g(x). The term importance 


sampling is used because w(x) determines the weight or “importance” of different 
points in the sample space. Importance sampling has been employed in the Bayesian 
simulation literature for many years and was introduced into Bayesian econometrics by 
Kloek and van Dijk (1978) as a way of evaluating posterior distributions. This material 
is further discussed in Section 13.4. 

The importance sampler Tis has variance S~!V plw(x)], given independent draws 
from p(x). This variance is clearly minimized if w(x) is a constant over the entire range 
of integration, since then V ,[w(x)] is zero. This is done by setting w(x) = E,[h(x)], 
as then p(x) = h(x)g(x)/E,[h(x)] is a density that integrates to 1. Unfortunately, this 
theoretically ideal importance sampling estimate is not practicable, as E,[h(x)] is un- 
known. However, it does indicate the potential gains to importance sampling, espe- 
cially if p(x) is chosen so that w(x) is fairly flat. 

Even if importance sampling leads to an increased variance, which can occur in 
practice, it does have other attractions. It produces a smooth sampler if w(x) is smooth 
in the parameters to be estimated. Moreover, it is useful if draws from g(x) are difficult, 
as can often be the case if x is a vector of correlated random variables. 

For the multinomial probit discrete choice model a popular importance sampler is 
the GHK simulator, due to Geweke (1992), Hajivassiliou and McFadden (1994), and 


407 


SIMULATION-BASED METHODS 


Keane (1994). This recursively truncates the multivariate normal pdf so that draws 
are restricted to the positive orthant. Advantages of this simulator compared to the 
frequency simulator are that it is smooth, requires many fewer draws for alternatives 
with low probability of being chosen, and is unlikely to have boundary problems. 


12.7.3. Variance Reduction by Antithetic Acceleration 


The preceding methods assume independent draws, using methods to be detailed in 
Section 12.8, from an appropriate distribution such as g(x) or, if importance sampling 
is used, from p(x). 

Variance reduction methods instead use dependent draws as these can reduce the 
variance of a simulator. A leading example is antithetic sampling that uses nega- 
tively correlated draws. Ripley (1987, pp. 129-132), Geweke (1988), and Hajivassiliou 
(2000) provide a discussion of this technique and Geweke (1995) surveys this and sev- 
eral other variance reduction techniques. 

Suppose we wish to evaluate the integral 7 in (12.46), where x is assumed to have 
zero mean and symmetric density g(x). The direct Monte Carlo integral, based on 2S 
simulated iid draws from g(x), is 


i 1 2S 
has (x) = 55 DAG’) 
s=1 


and, given independence of the 2S draws, has variance 
~ 1 
Vih2s (x) = a5 [h(x)]. 


Antithetic sampling uses an alternative estimate based on only S iid draws, 


sX F&a 
ha,s(x) = 5 x yh) +h(—x*)), (12.49) 
y= 


which is an average of h(x) evaluated at x° and —x*. The pair (x°, —x°*) is said to be 
an antithetic pair and yields an unbiased estimate of J since we assume x is symmet- 
rically distributed with zero mean. If the mean is instead u then (x°, 2u — x°) is an 
antithetic pair. Given S independent draws of x“ the variance of Ta. s (x) is 


pa fees a 
Viha,s (x)] = saa [h] + 2Cov[h(x"), h(—x*)] + VIA) 


= 5 (V[A(x)] + Cov[h(x), h(—x)]) . 


Antithetic sampling will therefore be more efficient than regular iid sampling if the 
covariance term is negative, since then the variance of Ta, s (x) is smaller than that of 
hy s(x). By switching the sign of the draw, and then reusing the draw, an attempt is 
made to induce negative correlation in the simulator. Negative correlation is assured 
when the function is linear, and also if the nonlinearity is not too severe. However, in 
general, one cannot be certain that efficiency gains will be realized. For example, if 
A(-) is symmetric about zero then Cov[h(x), h(—x)] = V[A(x)]. 
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Antithetic sampling can be extended to asymmetric density g(x). Suppose x can be 
drawn using the inverse transformation method given later in Section 12.8.2. Then one 
can draw u, say, from the uniform [0, 1], generate the antithetic transform (1 — u), and 
then use the inverse transformation method to draw from the distribution of choice, so 
xı = G7! (u) and x2 = G7! (1 — u), where G(-) is the known cdf of x. Then (x1, x2) 
form an antithetic pair and variance reduction occurs if 


Cov[h(G~' (w)), A(G! (1 — u))] = Covi f (u), f (1 — u)] < 0, 


where f(u) is the composite function h(G~! (u)). If f(-) is a monotonic function then 
the variance is reduced (Robert and Casella, 1999, p. 112). However, this property 
of the function may be difficult to verify. Further, the argument applies to the in- 
verse transformation approach only, whereas in practice other methods are used in 
pseudo-random number generation (see Section 12.8). Therefore it is difficult to ver- 
ify in advance whether the conditions for efficiency gains are attainable in a specific 
application. 

Although the dramatic gains in efficiency possible in some special cases may not 
materialize in more complex settings, worthwhile efficiency gains are realized in 
many cases. Antithetic sampling can also be used to accelerate importance sampling 
(Danielsson and Richard, 1993). 

Antithetic sampling extends to multivariate draws. Consider bivariate draws of 
(x, y), where the density is symmetric about (0, 0). In this case sign reversal is done 
first element by element and then for the pair. Thus the antithetic quadruple consists of 
((x*, y5), (—x*, y5), (x5, —y°), (—x*, —y*)). For an m-dimensional draw the same 
idea is repeated for all tuples. 


12.7.4. Computation Using Quasi-Random Sequences 


A second method of variance reduction involves replacing pseudo-random numbers by 
quasi-random numbers, which are systematic simulation draws designed to provide 
better coverage of the sample space. A potential limitation of the approach is that 
randomness is required to apply the laws of large numbers and central limit theorems 
that justify the simulation-based approach. 

Quasi-Monte Carlo methods use nonrandom points within the domain of integration 
instead of using S pseudo-random points. A leading example is Halton sequences, 
summarized in Press et al. (1993) and introduced into the econometrics literature by 
Bhat (2001) and Train (2003). 

Halton sequences have two desirable properties. First, they are designed to give 
fairly even coverage over the domain of the sampling distribution. With more evenly 
spread draws for each observation, the simulated probabilities vary less over observa- 
tions, relative to those calculated with random draws. This is similar to deterministic 
evaluation of an integral over a specified grid. Second, with Halton sequences, the 
draws for one observation tend to fill in the spaces left empty by the previous obser- 
vations. The simulated probabilities are, therefore, negatively correlated over observa- 
tions. As in the case of antithetic variates, this negative correlation reduces the vari- 
ance of the simulated function. Under suitable regularity conditions it can be shown 
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that the integration error using pseudo-random sequences is of order N~!, compared 
to pseudo-random sequences where the convergence rate is N~! (Bhat, 2001). 

Halton sequences are best described by example. Suppose that the function to be 
simulated depends on one random variable. The starting point is a prime number. The 
Halton sequence based on the prime number 2 is constructed as follows. Divide the unit 
interval (0, 1) into two parts. The dividing point 1/2 becomes the first element of the 
Halton sequence. Next divide each part into two more parts. The dividing points, 1/4 
and 3/4, become the next two elements of the sequence. Divide each of the four parts 
into two parts each, and continue to obtain the sequence {1/2, 1/4, 3/4, 1/8, 3/8, ...}. 
Similarly, the sequence based on the prime number 3 is {1/3, 2/3, 1/9, 2/9, 4/9, ...}. 
Halton sequences on nonprime numbers are not unique because the Halton sequence 
for a nonprime number divides the unit space in the same way as each of the prime 
numbers that constitute the nonprime. 

The length of each sequence is determined by the number of observations N and 
the numbers of simulation draws S. One discards the first few (say 20) elements of the 
sequence as the early elements have a tendency to be correlated over Halton sequences 
with different primes (see Train, 2003, for an example). Consequently, one could begin 
by generating Halton sequences of length N x S + 20 and discard the first 20 elements 
of each sequence. For each element of each sequence, calculate the inverse of the 
cumulative normal distribution. The resulting values are the Halton draws from the 
sampling distribution. 

One major advantage of quasi-random number draws is that the draws are designed 
to cover the sample space of random numbers in a more uniform fashion than in 
the case of pseudo-random numbers. This can be seen visually in Figure 12.1. In this 
figure, Panel 2 shows a draw from a bivariate normal distribution constructed using 
a Halton sequence. The remaining three panels show pseudo-random number draws 
from the same distribution. The more even coverage of the sample space is evident in 
the former case. 

For more thorough discussion and examples of simulation-based estimation that use 
Halton draws and impressive evidence of the relative efficiency of the approach in one 
or more dimensions, see Train (2003, Chapter 9). The method works very well for 
multinomial logit model with normally distributed random parameters (Section 15.7). 


12.8. Methods of Drawing Random Variates 


The preceding simulators require draws of random variates. In this section we summa- 
rize methods to take such draws from a density, denoted g(x) or p(x) in Section 12.7 
and denoted f(x) in this section. Usually it is sufficient to obtain draws from the 
uniform or the standard normal (which is possible in most popular software) since 
these can form the basis for making draws from distributions other than the uniform 
or normal. 

If the draws are to be used for simulation-based estimation then all draws from the 
uniform or standard normal should be made before any estimation, to prevent “chatter,” 
whereby iterative methods fail to converge owing to noise created by new draws at 
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Figure 12.1: Halton sequence draws (panel 2) compared to pseudo-random draws. 


each iteration. For example, if x ~ N[u, o°] and estimates of u and o change over 
iterations, then we make NS initial draws of z ~ M[Ọ0, 1] and then over iterations 
recompute x = u + oz using the original draws of z. 

This section provides a basic discussion of some standard methods for gener- 
ating random variates. For more advanced or extensive treatments, there are many 
good monographs and surveys, including those by Bradley, Fox, and Schrage (1983), 
Dagpunar (1988), Devroye (1986), and Ripley (1987). 

Before presenting the methods, note that the term random number generation is 
an oxymoron. A more accurate description is given by the term pseudo-random 
numbers. The essential characteristic of these generators is that they use determin- 
istic devices to produce long chains of numbers that mimic the properties of the real- 
izations from some target distribution. The specific target distribution will depend on 
the context, but for the purposes of this book uniform, normal, exponential, gamma, 
logistic, and Poisson distributions are standard. The chain process is started up by sup- 
plying a seed. After some finite but large number of values have been generated the 
cycle of numbers repeats itself. That is, the computer algorithms will generate exactly 
the same numbers beginning with a given seed. Good random number generators are 
those that generate a long chain of numbers without recycling and without any built-in 
dependence. The key consideration in choosing generators is whether the generated 
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distribution closely mimics the properties of the target distribution at a reasonable 
computational cost. 


12.8.1. Pseudo-Random Uniform Number Generators 


Pseudo-random uniform numbers are constructed using a deterministic sequence 
that mimics the statistical properties of a sequence of uniform random numbers. A 
good generator has a long period, has a distribution close to uniform, and produces in- 
dependent draws. It is important to have a good generator, as pseudo-random numbers 
from virtually any distribution can then be obtained by transforming uniform pseudo- 
random numbers (Bradley et al., 1983, p. 24). 

A standard generator begins with the equation 


Xj = (KX j-1 + c) modm, 


where the modulus operator a mod b forms the remainder when a is divided by b. This 
produces a sequence of integers between 0 and m, and the uniform random variable 
is then obtained as R; = X;/m (Ripley, 1987, p. 20). A value for Xo, referred to as 
the seed, is needed to initiate the generator. The uniform random sequences generated 
are deterministic, which permits replication as the same numbers should be drawn 
if analysis is repeated with the same value of the seed. The periodicity of the cycle 
depends on Xo, k, and c. If computation is done using 32-bit integer arithmetic the 
maximum periodicity is approximately 27! ~ 2.1 x 10°. However, it is easy to choose 
poor values of Xo, k, and c so that the periodicity is much lower than this. Books such 
as that by Press et al. (1993) should be consulted for potential pitfalls. 


12.8.2. Nonuniform Variates 


Random variables from many other distributions, including the normal itself, are usu- 
ally based on an initial draw of a uniform random number. Four commonly used meth- 
ods are (1) inverse transformation, (2) transformation, (3) accept—reject, and (4) mixing 
and compounding. 


Inverse Transformation 


Let F(x) denote the cdf of the continuous random variable x, that is, 
F(x) =Pr[X <x]. 
Given a draw of a uniform variate r, 0 < r < 1, the inverse transformation 
x= F`! (r) 


gives a unique value of x because F is continuous and monotonically increasing. 

For example, the cdf of the unit exponential is 1 — e™*. Solving r = 1 — e` 
yields x = — ln(1 — r). If we make a draw from uniform [0, 1] and get 0.64, then 
x = —Ìln(1 — 0.64) = 1.0217. Figure 12.2 plots the cdf of X and shows graphically 
how this method works. An arbitrary point on the vertical axis at height r is selected 
and the corresponding value on the horizontal axis is obtained by completing a rectan- 
gle. This is the inverse transformation. 


X 
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Inverse Transformation Method 


Caf F(x) 


0 1 2 3 4 5 
Random variable x 


Draw of 0.64 (vertical axis) yields x = 1.02 (horizontal axis). 


Figure 12.2: Inverse transformation method for making draws from the unit exponential. A 
random uniform draw of 0.64 (so F(x) = 1 — exp(—x) = 0.64) yields x = 1.02. 


This method is particularly easy to use if the analytical form of F (-) is given and x 
is a continuous random variable. If there is no closed-form expression available, then 
the method is still often feasible, albeit computationally more costly, as the inverse 
cdfs of standard distributions are often available as functions in programs. 

The method can be extended to discrete random variables with a cdf that is a step 
function. For example, if x takes integer values then a uniform draw r = 0.312 leads to 
a draw of x = j, where the integer j is such that F(j — 1) < 0.312 and F(j) > 0.312. 

A standard method for generating normal random variates is the Box—Muller 
method. This uses the inverse transformation method, applied to the joint distribu- 
tion of two independent normal variates rather than to a single variate. Specifically, if 
rı and rp are iid uniform then x; = /—2Inr, cos(2mr2) and x2 = /—2 Inr; sin(27712) 
are iid V[O, 1]. 


Transformation 


In some cases a random variable with the desired density can be obtained by suitable 
transformation of a random variable whose distribution is easy to draw from. Then 
random variates can be obtained by applying this same transformation. 

This transformation method is an obvious way to make draws from distributions 
based on the normal. Examples include squaring standard normal variates to obtain 
random variables with central chi-square distribution, adding squared values of r inde- 
pendent standard normal variates to yield chi-squared variates with r degrees of free- 
dom, and computing the mean square of independent chi-squares to yield F-distributed 
random variables. Transformation methods are not restricted to distributions based on 
the normal. 


Accept—Reject Methods 


Suppose we want to draw from the density f(x) but this is difficult, however, there is 
another density g(x) that covers f(x) in the sense that f(x) < kg(x) for all x for some 
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Accept-reject Method 


Desired density f(x) 
Envelope density g(x) 


f(x) and g(x) 


0 2 4 6 8 10 
Random variable x 


Figure 12.3: Accept-reject method draws from density g(x) where kg(x) envelopes the 
desired density f(x). 


finite constant k. This is depicted in Figure 12.3, where the thick line serves to mimic 
the envelope kg(x). 

The accept-reject method draws from g(x), rather than f(x). The draw is ac- 
cepted, x = r, if 


Ff (x) 


r < 2 
kg(x) 


where r is a draw from the uniform distribution. If the condition is not satisfied then 
the draw is rejected and further draws are made until the condition is satisfied. The 
appeal of the method depends on the ease of drawing from g(x) rather than f(x). The 
limitation is that on average a draw will be accepted with probability 1/k, so that many 
draws are needed if k is large. 

To see how this method works, let Y denote the random variable generated by the 
accept-reject method, X denote a random variable with density g(x), and U denote a 
draw from the uniform. Then Y has cdf 


Pr[Y < y] = Pr[X < y|U < f(x)/kg(x)] 


7 Pr[X < y, U < f(x)/kg(x)] 
Pr[U < f(x)/kg(x)] 


R fo. ee dug(x)dx 
a JS TARR dus(x)dx 
_ LAO kgg dx 

SLI @)/kg@)lg@)ax 


fol fQ)/kldx 
SUF œ)/k]dx 


= f(x)dx, 


which is the cdf corresponding to the density f(x) as desired. 
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Composition 


Sometimes the density f(x) can be expressed as being that from a mixture or a com- 
pound distribution, with 


fs) = f eolon (e) de. 


Then a draw from f(x) can be obtained by first making a draw of £ from density h (e) 
and then making a draw of x from the conditional density g(x|e). 

As an example, consider drawing from the negative binomial distribution with mean 
A and variance A(1 + aA), where both A and @ are given constants. Here we may use 
the fact that the negative binomial distribution can be regarded as a Poisson—gamma 
mixture (see Chapter 20). First, one draws £ from a gamma distribution with mean 1 
and variance œ, which can be done by a transformation of the exponential. Second, 
one draws from the Poisson distribution with mean àe, given € from the previous step. 

If h(e) is a discrete distribution with point mass p; at C points, j = 1,..., C, then 
the previous integration step is replaced by summation. Thus, 


c 
fx) = J pjgæle = ej). 
j=l 
Then, to make S draws from f(x), we draw Sp; observations each from g(x|e = €j), 
and “compose” the required sample of S values by pooling the draws. 


Some Standard Generators 


The tables in Appendix B describes pseudo-random number generation for several 
standard continuous and discrete cases. They are based on the assumption that r, r1, r2, 
... are values of independent uniform [0, 1] random variables R, Ri, R2,.... Note 
that there may exist different methods to generate the corresponding random variable; 
we list only one or two of these methods. 


12.8.3. Multivariate Distributions 


Draws from multivariate distributions are generally much more complicated than 
draws from univariate distributions. For example, methods such as inverse transfor- 
mation and transformation may no longer be applicable. For many multivariate dis- 
tributions the method of mixing or composition can be used, as many multivariate 
distributions are mixture distributions. 

Quite general methods are Gibbs sampling and other Markov chain Monte Carlo 
methods. These are deferred to Section 13.5, as they are extensively applied in 
Bayesian analysis, which uses complicated multivariate distributions. As will be ex- 
plained the draws made using the Gibbs sampler may show some tendency to be cor- 
related, a fact that will reduce the efficiency of the simulator. 

Here we restrict attention to the multivariate normal. Then draws are easily obtained 
by transformation of univariate standard normal draws. Specifically, suppose we wish 
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to make draws from a g-dimensional normal distribution, so x ~N (0, ©). This can 
be done by transformation based on the result that a positive definite & has Choleski 
decomposition 


X = LL’, 


where L is a lower triangular matrix. For example, for q = 2 the Choleski decompo- 


sition is 
E A _ b 0 | ie d 
O12 022 lı Io |LO 2f’ 

yielding three equations / a = 041, [yylo) = 012, and ee + l2, = on that can be solved 
for l1, l21, and l2. Given a q-dimensional vector e whose elements have standard 
normal distribution, it is easy to verify that if € ~ N (0, D), then x = Le, a linear com- 
bination of normals, has distribution M (0, ©). Specifically, E[Le] = 0, and V[Le] = 
E[Lee’L’] = LL’ = X. The key to this method is that linear combinations of the nor- 
mal are also normally distributed, result that does not hold for nonnormal distributions. 


12.9. Bibliographic Notes 


Press et al. (1993) provide a good starting point for both quadrature and Monte Carlo integration 
and give further references, including some given elsewhere in this chapter. 

The econometrics literature on simulation-based estimation emphasizes the multinomial pro- 
bit model. The methods have much wider applicability, however, and can be more easily and 
successfully implemented in other models that are less challenging to fit than the multinomial 
probit. Lerman and Manski (1981) used simulated frequencies to estimate choice probabilities 
and found that many draws were needed. McFadden (1989) proposed MSM and demonstrated 
its consistency and asymptotic normality. Pakes and Pollard (1989) provide a quite general 
treatment of the asymptotic theory for both MSM and MSL. The relatively accessible survey of 
Stern (1997) is an excellent place to start. Gouriéroux and Monfort (1996) provide a textbook 
treatment of the basic methods. Many other references are better read in the specific context 
of models that are discussed in later chapters. In particular, Hajivassiliou and Ruud (1994) em- 
phasize truncated normal models including the multinomial probit and Train (2003) considers 
a range of discrete choice models including the random parameters logit. 


Exercises 


12-1 To estimate the integral / = f t(x)g(x)dx by Monte Carlo, the sum T= N" L 
t(xi)g(xi)/ p(xi) is used, where x; are draws from the importance sampling distri- 
bution p(x). Show that plim / = /. 

12-2 For f(60) =E"? [1 + 40 — uy E~ (0 — u) "+92, consider the d-dimensio- 
nal integral Spa f(0)dé. The integrand is the kernel of a multivariate-t density, 
so the correct answer is the inverse of the normalizing constant. 

(a) Evaluate this integral as a Monte Carlo average S'ES, f(0)/h(6"), 
(9 ~ h(@), where the importance density A(@) is multivariate-t with the 
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same location and scale as f(@), but with a different degrees-of-freedom 
parameter. 

(b) Explore the stability of this average as you vary the degrees of freedom of 
h(@). Increase the mismatch between f(6) and h(6) by changing the location 
and scale of h(@) and explore further. 


12-3 For the MSM estimator in Section 12.5.3 suppose that the simulator is the fre- 
quency simulator. 


(a) Show that Vy ul[77(80)] = (14+1/S)Vy[m4o)]. 

(b) Hence show that the effect of simulation using the frequency simulator is to 
inflate the variance of the method of moments estimator by (1 + (1/S)). 

(c) How large is the efficiency loss for the standard errors if S= 10? 


12-4 For the example in Section 12.5.6 consider the estimator @ that solves a [Yi — 
ae (a + uĵ)] = 0. Obtain analytical expressions for this estimator and its 
variance. 


12-5 (a) Write an algorithm for drawing a pseudo-random sample from a three- 
dimensional multivariate normal distribution M[0, £] with oj; = 1, j = 1, 2.3, 
and covariances o42 = 013 = 003 = 0.5. Draw a sample of 1,000 realizations 
and compare the estimated means and variances with those of the dgp. 

(b) Repeat part (a) with the trivariate normal being replaced by a Student's 
t-distribution with five degrees of freedom. 


12-6 Write a computing procedure to make draws from a univariate truncated nor- 
mal density TMJa glu, o°] using the inverse transform method given in Section 
12.8.2. Here [a, b] are lower and upper truncation points. Choose u = 1, o? = 4, 
and a= 3, b= 4. 


12-7 Consider the standard binary logit regression model (see Section 14.3). 


(a) Write down the log-likelihood function. 

(b) Introduce a random intercept assumption in which the intercept is drawn 
from a suitable distribution with finite mean and variance. What justifica- 
tion can you offer for introducing an unobserved heterogeneity term in this 
way? If the logit model is derived from the random utility model with extreme 
value errors, how does the random intercept affect that interpretation and/or 
derivation? [See Revelt and Train, 1998.] 

(c) Suggest a suitable distributional assumption for the random intercept; 

rewrite the likelihood function conditional on unobserved heterogeneity. Next 

write down the likelihood function with unobserved heterogeneity integrated 
out. 

Describe in a step-by-step manner how to use the maximum simulated like- 

lihood estimation procedure to estimate this model. Explain, with details, 

how to calculate the variance matrix of unknown parameters. How would 
you decide how many simulations you will use? 

(e) Consider the method of simulated moments as an alternative to the MSL 
procedure for the random parameter logit. Write down the moment condi- 
tion(s) conditional on the unobserved heterogeneity term. Then outline an 
MSM estimation procedure for this model. 


(d 


— 


12-8 Some computing packages allow you to draw both Poisson and Gamma pseudo- 
random numbers directly. It is also known that the negative binomial distribution 
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can be derived as a mixture of Poisson and gamma random variables (see Sec- 

tion 20.4). 

(a) Write down a procedure for drawing negative binomial—distributed variables 
using the method of mixtures. 

(b) Apply your method by drawing a sample of 10,000 on a Poisson-distributed 
variable with mean 0.25. 

(c) Draw a corresponding sample from a Gamma distribution with mean 1 and 
variance a, with a set to produce negative binomial random variables with 
variance 0.3125. 
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CHAPTER 13 


Bayesian Methods 


13.1. Introduction 


This chapter serves as an introduction to Bayesian econometrics. Bayesian regres- 
sion analysis has grown in a spectacular fashion since the publication of books by 
Zellner (1971) and Leamer (1978). Application to routine data analysis has also ex- 
panded enormously, greatly aided by revolutionary advances in computer hardware 
and software technology. In the light of such major developments, a single chapter 
can never do adequate justice to the many facets of this subject. This chapter therefore 
has the very modest goal of providing a rough road map to the major ideas and devel- 
opments in Bayesian econometrics. Despite this modest objective some parts are still 
quite technical. 

The Bayesian approach, unlike the likelihood or frequentist or classical approach 
presented in previous chapters, requires the specification of a probabilistic model of 
prior beliefs about the unknown parameters, given an initial specification of a model. 
Many researchers are uncomfortable about this step, both philosophically and practi- 
cally. This has traditionally been the basis of the concern that the Bayesian approach 
is subjective rather than objective. It will be shown that in large samples the role of 
the prior may be negligible, that relatively uninformative priors can be specified, and 
that there are methods available for studying the sensitivity of inferences to priors. 
Therefore, the charge of subjectivity may not always be as serious as many claim. 

Bayesian approaches play a potentially large role in applied microeconometrics, 
especially when dealing with complex models that lack analytically tractable likeli- 
hood functions. Chapter 12 introduced simulation-based methods for such situations. 
These methods, particularly simulated likelihood, are potentially problematic as they 
generally require maximization of a function using a sufficiently large number of sim- 
ulation draws that increases at an appropriate rate as the sample size grows. Even with 
today’s powerful computers, analysis of large samples and high-dimensional models 
can require a formidable amount of computation. Bayesian methods, in contrast, do 
not require maximization algorithms. Bayesian procedures are flexible enough to be 
adapted to produce estimates that are excellent (if not perfect) substitutes for maximum 
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likelihood estimates, which are obtained in many cases more efficiently. Indeed, it is 
not necessary that one goes through a philosophical conversion to use these proce- 
dures; they can be adapted for pragmatic reasons. 

The foregoing remarks do not mean that Bayesian procedures do not have a deeper 
rationale and justification. They do. Three features in particular deserve to be men- 
tioned. First, Bayesian procedures can yield the entire posterior distribution of the 
parameters of interest, leaving the user to decide which moment or quantile of the 
distribution to report, potentially on the basis of decision-theoretic criteria. One does 
not need separate estimators for means, medians, quantiles, and so forth as the pos- 
terior distribution has them all! Second, Bayesian analysis, being conditional on the 
data, yields exact finite-sample results, obviating the need for finite-sample corrections 
or adjustments. This distribution approaches the normal distribution in large samples 
where the influence of the priors vanishes. Third, Bayesian methods provide a natural 
way to select models. 

Section 13.2 introduces the basic concepts and components of Bayesian analysis 
and the key properties of Bayesian estimators. These ideas are illustrated in Section 
13.3 for the relatively tractable linear regression model. More generally, no closed- 
form solution exists for the posterior distribution. Section 13.4 presents Monte Carlo 
integration methods, notably importance sampling, used to obtain numerical estimates 
of posterior moments. Section 13.5 details Markov chain Monte Carlo methods, no- 
tably Gibbs sampling and the Metropolis—Hastings algorithm, used to obtain draws 
from the (intractable) posterior distribution. An example of these methods is given in 
Section 13.6. The additional topics of data augmentation and Bayesian model selection 
are presented in Sections 13.7 and 13.8. 


13.2. Bayesian Approach 


In the Bayesian approach uncertainty about the value of the parameters @ is explicitly 
modeled by introducing a density 7 (0) for the prior distribution, so named because it 
is specified without considering the data currently in hand. It expresses subjective be- 
liefs about the true unknown parameter in the language of probability. Specification of 
the prior is studied in detail in Section 13.2.4. As an example, suppose that @ is an in- 
come elasticity and on the basis of an economic model or previous studies it is felt that 
6 lies between 0.8 and 1.2 with probability 0.95. Then a prior for 0 is 9 ~ N{1, 0.17]. 

The other ingredient of Bayesian inference is the sample joint density or likelihood 
f(y|@), where in the single-equation case y is an N x 1 vector. Dependence on re- 
gressors is suppressed throughout this section, for notational simplicity. Exogenous 
regressors are introduced in Section 13.3, in which case f(y|0) becomes f(y|X, 0) 
and Bayesian analysis is then conditional on regressors. Note also that in this chapter 
f(-) usually denotes the joint density of all observations, rather than the density of the 
ith observation. 

If no data are available then all we have is the prior. After data are observed, the clas- 
sical approach is to estimate the unknown parameter 0 using the maximum likelihood 
principle. The Bayesian approach instead combines the likelihood of the sample with 
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the prior, reflecting the view that any prior information should be exploited, even if it 
is in the form of a probability distribution. This process can be thought of as a revision 
of the prior given the data (likelihood). Indeed, we can derive a distribution of 0 after 
combining the likelihood and the prior. The resulting distribution is called a posterior 
distribution, and it reflects the investigator’s beliefs about @ a posteriori, that is, after 
observing the data. 


13.2.1. Bayes’ Theorem 


The basic result that delivers the posterior distribution is Bayes’ Theorem, also re- 
ferred to sometimes as Bayes’ inverse law of probability, that 


f(y|@) (9) 
(Aly) = —__., (13.1) 
í fY) 
where f(y) denotes the marginal probability distribution of y, formally defined as 
fy= f flyl@x(@)de, (13.2) 
RO) 


where R (0) denotes the support of 2(@). This result is obtained by noting that, for 
events A and B, the conditional probability 
Pr[A N B] 
Pr[B] 
_ Pr[B|A] Pr[A] 
E Pr[B] 
where the second equality follows because Pr[B|A] = Pr[A N B]/ Pr[A]. 
Because the denominator f(y) in (13.1) is free of 8, we can more simply write 
p(9ly) as proportional to the product of the pdf and the prior; thus 


ply) x Ly|@)x (0). (13.3) 


This simplifies derivation and representation of the posterior, by omitting inessential 
constants that can be recovered later, as will be illustrated in Section 13.2.2. When a 
density function is written without normalizing constants it is referred to as a density 
kernel. 

In many cases (13.1) or (13.3) do not yield a closed-form expression for the pos- 
terior density. A closed-form expression is not needed, however, and later sections 
present recent simulation-based techniques for obtaining good numerical approxima- 
tions to the posterior density. These techniques permit Bayesian analysis for almost 
any parametric microeconometrics application. 

It is common to use a special symbol for the posterior density, so we will replace 
f (Aly) by p(O@ly). Also, the original joint density, f(y|@) is the likelihood function 
L(y|@). Henceforth we will write the posterior density as 


Pr[A|B] = 


’ 


p(Oly) x L(y|0)x (0). (13.4) 


This representation, the key one for the Bayesian approach, emphasizes an impor- 
tant difference between the frequentist and Bayesian approaches. In the frequentist 
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approach, the true value of the parameter is constant but parameter estimates are treated 
as random variables. In contrast, in the Bayesian approach the parameter is treated as 
if it is random. 


13.2.2. Bayes’ Theorem Example 


Suppose y ~ NTO, 07], where o? is known but the scalar parameter 6 is unknown. 
Given a random sample (y1, ..., yw), the joint density of y is 


N 
Lyl9) = | | (220?) exp {— (yi; — 0} /207} 


i=1 
N 
= (270?) *” exp |- Y (i - 6 ja] 
i=1 
N os 2 
cc exp {35 (9-8) | 


where y=N! Ð; yp and we use $; Q -0 =} -yt+y-OP = 
Lo- + VO - a)’. Multiplicative terms not involving 6, which are 
absorbed in the constant of proportionality, are dropped. The frequentist approach 
maximizes the log-likelihood with respect to 6, leading to the MLE 8 = y. 

The Bayesian approach additionally specifies a prior for 0. An analytically conve- 
nient choice is the normal prior, with 0 ~ \[y, t°], where we suppose that values 
of the prior mean ju and prior variance t? are specified. A large value of t? indicates 
greater prior uncertainty than a small value. Then the prior density is 


(0) = (201?) exp {—(@ — p? /2t?} 
a exp f- (0 — mw) /21°}, 


where (2x ie k which is free of 0, is absorbed into the factor of proportionality. 
Using (13.4), we obtain the posterior density 


L(yl0)x (0) 
SS Lylo) (0)d 0° 


p@ly) = œ < 0 < œ. (13.5) 


The denominator ensures that the posterior is proper (i.e., it integrates to 1). For some 
purposes the denominator can be ignored, in which case we work with p(0|y) « 
L(y|@)z(0). The numerator can be expanded as follows: 


L(y|@)z (0) 
7 N (yi —0Y = 6 — py 
= 27) (N4 D/Z?) N/2(¢2) 1/2 exp ee 3 (yi 0%? ( =p | i 
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Because 
N N 
(i — OP = S00; -¥P + NF 8Y, 


1 i=l 


l 


and noting that the constant of integration in (13.5) and other multiplicative constants 
independent of 6 can be absorbed into the proportionality constant, we have 


N 1 (0 — u? 
pem & exp | - 5 e-e- ma | (13.6) 
1| @0-u? O-o 
«| | 72 + ze 
2 
x exp {-5 [|| (13.7) 
ji 


The last line is the kernel of N[u;, Z| distribution, where 
ui = ti (Ny/0° + ujt’), (13.8) 
t? =(N/o? + 1/7)". 


The final line in (13.7) is obtained by completing the square, using the result that for 
arbitrary scalars z, y, a), a2, C1, and c2, we have 


2 
oy SS Oy pes aC) ee eae 
ci(z — ay)” + e(z — an)” = (ci + c2) (z ( ay )) + a pa a), 


where z = 0, a, = u, do = Y, c1 = l/t?, and cp = 1/(N7!0o? + t°). The terms free 
of 0 are dropped. 
In summary, we have the following: 


Data: y|6 ~ N[@, 07], o? known. 
Prior: 6 ~ N[u, 7], u, t? specified. 
Posterior: Oly ~ N[u1, tl Hi, T? given in (13.8). 


The posterior mean j1; is a weighted sum of the prior mean jz and the sample mean y 
with weights that reflect the precision of the likelihood via o7/N and the prior via Tt”. 
Bayesian practice is to summarize variability using the precision parameter, defined 
as the reciprocal of the variance. Here the posterior precision 1, 2 is the sum of the 
sample precision of y, N/o”, and the prior precision | /t”, so precision is increased 
by pooling the sample and prior information. 

If the prior information is imprecise, so that 1/t7 is small, then the weight assigned 
to the prior mean is also small relative to the sample information and the prior plays 
a minor role in generating the posterior. Similarly, the sample information also dom- 
inates as the sample size gets large, since then N/o? gets large relative to 1/t?. The 
posterior distribution tends to the familiar asymptotically normal, except the Bayesian 


result is that 9 ~ Ny, o7/N] rather than y ~ NIO, o?/N]. 
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Bayes: Likelihood, Prior and Posterior 


=-=-=: Likelihood N[10,2] 
ensenaseasaes? Prior N[5,3] 


Posterior N[8,1.2] 


Density 


Evaluation point 


Figure 13.1: Bayesian analysis for mean parameter of normal density: plot of normal likeli- 
hood (right), normal prior density (left), and resulting posterior density (center). 


As a concrete example, suppose o? = 100, the prior sets u = 5 and t? = 3, anda 
sample of size N = 50 has sample mean y = 10. Then the likelihood is M[10, 2], the 
prior is V[5, 3], and from (13.7) and (13.8) the posterior is N[8, 1.2]. These densities 
are plotted in Figure 13.1. The posterior mean lies between the prior mean and the 
sample mean, whereas the posterior has variance that is smaller than the variance of 
both the prior and the likelihood. 


13.2.3. Bayesian and Non-Bayesian Approaches Compared 


It is useful to draw parallels and contrasts between the frequentist and Bayesian 
approaches. 

In a parametric frequentist formulation the likelihood function is the main ba- 
sis of statistical inference. Under suitable regularity conditions the MLE is consis- 
tent and asymptotically normal. Sampling theory of estimators provides a basis for 
probability statements about the estimated magnitudes, or functions thereof, or con- 
ditional prediction. Prior information on parameters is incorporated by restricted ML 
estimation. 

In a Bayesian analysis, summarized in Table 13.1, the data-generating process and 
the data are combined with a prior distribution on the parameters. Specification of this 
prior distribution is discussed in detail in Section 13.2.4. The prior embodies prob- 
abilistically specified information before the current data are analyzed and may be 
based on “received information.” The prior information and the data are combined 
using Bayes’ Theorem. 

The outcome of this exercise is the posterior distribution of the parameters 8, which 
we may think of as the translated likelihood function. Alternatively, given the data, the 
posterior distribution reflects our “revised prior.” If the sample is small, and perhaps 
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Table 13.1. Bayesian Analysis: Essential Components 


Component Formula 
Sampling model (y1,----, Yy) lid from f(y|@) 
Joint density/likelihood fO|9), LIO, de O 
Prior distribution m(@), O0€O 
= f(yl|O)x(0)/ f f(yl@)x(@)d0 
Posterior density P(Oly) : 4 x f(y|@)z(0) 


x L(y) (6) 
parameter estimation 
probability statements 
prediction 
model comparison 


Posterior pdf — posterior inference — 


relatively uninformative, the posterior may look like the prior, but if the sample is 
large, the posterior distribution will reflect the features of the data. 


13.2.4. Specification of the Prior 


Bayesian analysis requires specification of the dgp f(y|@) and of the prior 7(0). The 
dgp is usually specified to be the same as that used in a fully parametric likelihood- 
based analysis. For binary outcomes a logit or probit model might be specified, for 
count data the Poisson or negative binomial model would be specified, and so on. 

The principle challenge introduced by Bayesian analysis, compared to classical 
analysis, is the need to additionally specify a prior distribution. Results can vary with 
the choice of prior, as different priors lead to different posterior distributions unless 
the sample is large enough that the sample information dominates. 

One approach is to choose a prior such that it has little impact on the posterior, 
so that results essentially are based on the sampled data. An alternative approach, 
warranted when strong prior information is available, is to specify a prior that reflects 
this information. Both approaches, especially the latter, were historically constrained 
by issues of tractability of the resulting posterior, but this has now become much less of 
a consideration given recent computational advances. A popular intermediate approach 
is to use hierarchical priors, with uncertainty about parameters expressed in terms of 
probability functions that themselves involve other parameters about which we are also 
uncertain. 


Noninformative Priors 


A noninformative prior is one that has little impact on the resulting posterior 
distribution. 

The obvious way to try to obtain a noninformative prior is to use a uniform prior 
with 2(@) = c for all 0, where c > 0 is a constant, since this places equal weight on 
all possible values of 0. 
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One disadvantage of the uniform prior is that if it is used in settings where the pa- 
rameters 0 are unbounded then the prior is an improper density because then neces- 
sarily f 2(0)d0 = oo. The resulting posterior distribution may then also be improper, 
though in several leading examples the posterior is nonetheless proper. 

Another disadvantage of the uniform prior is that it is not invariant to reparameteri- 
zation. For example, for a scalar parameter 0 > 0 an alternative obvious parameteriza- 
tion of the density of y is in terms of the parameter y = In 0, as then —co < y < œ. 
If 0 has a uniform prior, x (0) = c, then the corresponding prior 2*(y) for y is not the 
uniform since m*(y) = (0) |d@/dy| = ce”. Although seemingly uninformative for 
one parameterization, the prior is informative in another parameterization. 

The uniform prior can be emulated by specifying a proper prior that has very large 
variances. For example, suppose the scalar @ has V[j, t°] prior, where t? is very 
large. Then for values of 6 likely to be supported by the data the prior 7(0) ~ 1/ 
(27T7), a constant, because exp [-©@ — p)/ 21°] ~ 1. It is important to note that this 
obvious approach, called a vague or diffuse or flat prior, has the same weakness as 
the uniform prior. It is not invariant to reparameterization. 

Instead, a widely used noninformative prior is Jeffreys’ prior, 


m(0) x |Z (0)|'/, (13.9) 


where for a vector 0, |Z (0)| is the determinant of the information matrix Z (0) = 
—E[0?L/ 00 a6’ | with £ = In L(y|@). Jeffreys’ prior, named after the pioneering 
Bayesian Harold Jeffreys, has the property of invariance to reparameterization or 
transformation of model parameters, so that same prior information is being given 
regardless of the particular parameterization chosen. 

To verify Jeffrey’s rule, for simplicity consider the scalar parameter case. Given 
transformation y = h(6), dL/dy = dL/00 x 00/dy and 


L PL F aL 970 
ay? 302? \ay a0 dy?’ 


Taking expectations with respect to the sample density and noting that E[d£/d0] = 0 
by the property of likelihood scores yields 


a0 \? 
ry) =10)(*) : 


It follows that 


IZ)? = IT 0)" 


00 
a 
In general the prior x (0) for 6 implies the prior for y is m*(y) = z (0) x |d0/dy|. Spe- 
cializing to prior (13.9), we have m*(y) « IZ (@)|/? x |d0/dy|, but this is IZ (vy)? 
as desired. 

As an example, suppose y ~ VV [u, o°], and consider three cases. First, if u is the 
unknown parameter and ø? is known, then the information measure for jz is Z (u) = 
N/o*, and Jeffrey’s prior |Z (12)| 1/2 œ c, a constant since here g? is known. Note that 
this prior is an improper prior. Second, if o? is unknown and u is known, then the 
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information measure for o° is Z (o°) = N /(20*), and Jeffrey’s prior |Z (d|? x 


o~?. Third, if both u and o? are unknown then the information matrix |Z (u, o?)| = 
(N /0°) (N /20*) = N*/20°. Therefore, Jeffreys’ rule implies that the joint prior 
T (u, o°) œ o~. Note that this is different from what we get if we apply Jeffreys’ 
rule to the separate priors for u and o°, as m (u) œc and a (o°) œ o7? yields 
m (Walo?) Xa. 

Jeffreys’ rule can serve as a method of generating a prior when there are no obvious 
candidate priors available. However, the literature does not seem to have resolved the 
issue of whether the rule produces a noninformative prior and if so in what sense. 
Further, as is clear from the preceding example Jeffreys’ prior can be improper, which 
may lead to an improper posterior. 


Conjugate Priors 


When a proper prior is specified, either as an informative prior or as a diffuse prior, it 
is convenient to choose a functional form for the prior that, given the specified sample 
density for the data, leads to a “nice” analytically tractable expression for the posterior, 
such as (13.7). 

Such tractable results most often arise if the sample and prior densities form a nat- 
ural conjugate pair, defined as having the property that sample density and prior and 
posterior distributions all lie in the same class of densities. Then the prior is called 
a natural conjugate prior. Section 13.2.2 gave an example, where for normally dis- 
tributed data a normal prior for the mean leads to a posterior that was also normal. 

The exponential family is essentially the only class of densities to have natural 
conjugate priors. A one-parameter member of the exponential family has a density 
that for a single observation can be expressed as 


f(y|0) = exp{a(@) + b) + c(@)u(y)} (13.10) 
cx expfa(0) + c(8)u(y)}, 
where different functions a(-), c(-), and u(-) lead to different densities in the family, and 
b(-) is a normalizing constant. For example, setting c(@) = w/o7, a(@) = =u? /20?, 
and u(y) = y yields the kernel of the N[y, o°] distribution (for o? known). Note 
that setting u(y) = y yields the linear exponential family, presented in some detail in 
Section 5.7.3. More generally, if 0 is a vector then c(@)u(y) is replaced by e(@)'u(y), 
where usually u(-) has the same dimension as 0. 
For a random sample of size N the exponential family leads to sample density 


L(y|@) x exp{Na(@) + c(@)t(y)}, (13.11) 
where t(y) = }_; u(y;). Consider the following prior on 6: 
x (6|B, a) x exp {Ba (0) + ac(A)} , (13.12) 


where g and $ are specified parameters of the prior and the functions a (-) and c(-) are 
the same as those in (13.10). This density is an exponential family density for 6 once 
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Table 13.2. Conjugate Families: Leading Examples 


Distribution Sample Density Conjugate Prior Density 
Normal N [0, o°] O~ N [u, r°] 

Normal N [u, 1/67] 6~ Gla, B] 

Binomial BIN, 6] 0 ~ Beta[a, £] 

Poisson P [0] 0 ~ Gla, B] 

Gamma G [v, 6] 0 ~ Gla, B] 


Multinomial MN [01,..., Ok] 01, ..., 0g ~ Dirichlet[a,,..., œz] 


a is viewed as fixed. Applying Bayes’ Theorem and simplifying, we get 


p Oly) x L(y|@)x (IB, a) (13.13) 
ax exp {(B + N)a(@) + (a + t(y))c@)}, 


which is readily verified to have the same kernel as the original prior in (13.12). Com- 
parison of the posterior with the sample density reveals that the prior is treated as 
providing an additional 6 observations yņ, say, with t(yp) = a. 

Table 13.2 presents some standard conjugate families, where the relevant densi- 
ties are provided in Appendix B. The gamma includes exponential and chi-square as 
special cases. Negative binomial, uniform, and Pareto likelihoods also have conjugate 
prior densities. 

An attraction of a conjugate prior is the resulting computational and analytical sim- 
plicity. Nevertheless, using a conjugate prior is a restriction and the justification for 
imposing it is less compelling now than it was in the past when computational re- 
sources available to a typical researcher were rather limited. 

Another advantage of having a posterior that is in the same class as the prior is that 
the posterior can easily replace the prior as a new (data-based) prior for a later analysis. 
If a prior is to be interpreted as “received information,” then one may take the posterior 
from one investigation as a prior for the next. 


Hierarchical Priors 


Hierarchical priors are those that arise when the parameters in a prior are themselves 
modeled as having a distribution. The parameters that appear in such a “prior on a 
prior” are called hyperparameters. 

The data have joint density L(y|@), as in Section 13.2.1, but now the prior on 0 
depends on parameters 7, say, that are random rather than fixed. Thus the prior on 
0 is 1(@|7T), where the parameters T in turn have a prior 2(7). The joint prior is 
m(0,7) = m(O|7)x(7), and Bayes’ rule yields the joint posterior 


PO, Tly) x L(y|@)x(O|7)x (T). 
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Interest will usually lie in the marginal posterior for 8, which is obtained by inte- 
grating the joint posterior with respect to r. The specified parameters of the prior 
m(T) are called hyperparameters. Alternatively, these parameters in turn can be 
given a prior, in which case another hierarchical level is introduced leading to joint 
prior 2(0|7)2(7|)x(@), and so on. Recent advances in computational methods for 
Bayesian analysis, particularly the Gibbs sampler, are well suited to hierarchical priors 
because of their recursive structure. 

Hierarchical priors can be viewed as a Bayesian analogue of random coefficient 
models in a classical setting. For example, for iid count data we might suppose that 
yi ~ P [0i], where the Poisson parameter is now random. A convenient distribution 
for 6; is the conjugate gamma distribution, so 6; ~ G [a, 8]. The classical approach 
estimates œ and 6 by maximum likelihood. A nonhierarchical Bayesian model speci- 
fies values for œ and £ and obtains the posterior for 6;. A hierarchical Bayesian model 
specifies priors for œ and £, such as the gamma that is conjugate, and first obtains the 
joint posterior for 6;, œ, and 6 before finding the marginal posterior for 6;. 

Hierarchical priors arise naturally in the context of hierarchical models, also 
known as multilevel models. Such models are widely applied in classical settings 
using special purpose software (Bryk and Raudenbusch, 1992, 2002). An early con- 
tribution by Lindley and Smith (1972) analyzed hierarchical regression models in a 
Bayesian setting. Hierarchical modeling has a natural appeal when the data to be an- 
alyzed naturally fall into strata, groups, or layers, and further one may expect to see 
groupwise parameter variation in the relationship of interest. For example, observa- 
tions on test scores could come from students in specific grades and schools. Modeling 
of test scores could involve individual characteristics that by definition vary across in- 
dividuals, class characteristics that vary across grades, and school characteristics that 
only vary across schools. Because such data will involve clustering of observations, 
this topic is also discussed in Chapter 24. Such models also have a close relationship 
with random effects formulation for panel data. 

As an example, suppose that data naturally fall into J groups, and that the pop- 
ulation mean of y differs across the groups. For individual i in group j suppose 
yi; ~ N[0;, 07], where for simplicity we assume o° is known. Then the sample mean 
in the jth group y; ~ N[9;, o7/N j], where N; denotes the number of individuals in 
the group and independence is assumed. A hierarchical model specifies the means 6; 
to have prior 0; ~ N[, t°], for example, where additional priors are specified for the 
parameters u and T° of the higher level prior. 


Sensitivity Analysis 


In a frequentist analysis one may entertain a variety of exact prior restrictions in for- 
mulating a model for estimation. For example, a model may be estimated under one 
or more sets of restrictions, and the results can be compared to form an idea of the 
sensitivity of the estimates to prior assumptions. 

The same logic and approach applies in Bayesian analysis. One need not take the 
prior to be literally true, and one can perform a sensitivity analysis that studies how the 
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posterior changes with different choice of prior. Similarly, one can vary assumptions 
about the dgp and see how posterior beliefs change in response. 


13.2.5. Densities and Measures Related to the Posterior 


Bayesian analysis is based on the posterior distribution. For convenience Bayesian re- 
gression results usually report only summary measures, such as posterior moments, 
quantiles, or marginal distributions of components of 9. However, the posterior distri- 
bution is also used for prediction and probability statements, detailed in this Section, 
and for model comparison, presented in Section 13.8. 

Several quantities play an important role in a Bayesian analysis. 


Marginal Posterior 


In general @ is multidimensional, denoted by 6’= (0i, <- 04) and interest may lie 
in the posterior distribution of individual components of 0. The marginal posterior 
density of the kth parameter, 0%, is obtained by integrating out of the joint posterior 
all the remaining (q — 1) elements of 0. Formally, this is denoted as p(@|y) and is 
obtained by calculating the (q — 1)-fold integral 


p(Oxly) = f PA, ... , Oply)d01..d0x-1d0k41.-d04 (13.14) 


= [ romao, 


where the more compact notation in the second line contains @_,, which means all 
elements of 0 other than 6,. The marginal posterior density is usually asymmetric and 
need not be unimodal, whereas the asymptotic normal distribution for classical esti- 
mators is symmetric and unimodal. It can be useful to graph the posterior, especially 
if it departs considerably from a symmetric unimodal distribution. 


Posterior Moments 


Classical regression output reports the parameter estimate and standard error. For 
Bayesian regression one can similarly report the mean or median and the standard 
deviation of the marginal posterior density of each parameter. 


Point Estimation 


In classical analysis there is an unknown true parameter value 09 such that the dgp 
is f(y|@5), and we seek a point estimate that is a good estimate of 09. In Bayesian 
analysis, in contrast, interest lies in the entire distribution of 8, which is determined by 
both 09 and prior beliefs about Oo. 

Point estimation is therefore emphasized much less in Bayesian analysis. For conve- 
nience the posterior mean or the posterior median are nonetheless commonly reported 
as point estimates. By specifying a loss function an optimal point estimate of a param- 
eter can be obtained; see Section 13.2.7. 
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Posterior Intervals 


Once the posterior distribution has been obtained, it can be used to make probability 
statements analogous to those in the frequentist analysis. In particular, we can consider 
Bayesian confidence intervals and regions. 

For the kth parameter, a 100(1 — œ) % posterior density interval R(6,) is any 
interval that 6; falls into with posterior probability œ, or formally 


1 — a = Pr [0 € RODIY] = / P(Oxly)d0. (13.15) 
RO) 


There are many regions that correspond to this probability. The simplest posterior in- 
terval is one between the a/2 and (1 — a/2) quantiles, such as between the 2.5 and 
97.5 percentiles. More complicated is a highest posterior density (HPD) interval 
that satisfies (13.15) and additionally the condition that no point in R(@) has a smaller 
probability density than any point outside the region. This interval need not be con- 
tiguous if the posterior is multimodal, and it differs from the simpler interval unless 
the posterior is symmetric and unimodal. 

These intervals can be extended to regions. A 100 (1 — œ) % highest posterior den- 
sity region (9) is a region such that 


1 — æ = Pr [0 € R(6)ly] = I p(0ly)dð. (13.16) 
RO) 


An attraction of the Bayesian approach is that a posterior interval is much simpler to 
interpret than a confidence interval in frequentist analysis. If a 95% posterior interval 
for & is (1, 4), then & lies between 1 and 4 with posterior probability 0.95. In contrast, 
for a frequentist 95% confidence interval for 6, equal to (1, 4) we can only say that if 
it were possible to repeat the analysis with many different samples yielding many 
different confidence intervals, then 95% of these confidence intervals will include the 
true value of 6;. 


Hypothesis Testing 


Hypothesis testing receives little attention in the Bayesian context. As noted in the 
discussion of point estimation, interest does not lie in determining the true parameter 
value 0. Instead, interest lies in the distribution of the range of values that @ might 
take given the data and a prior. For model comparison see Section 13.8. 


Conditional Posterior Density 


The conditional posterior density of 0%, given 6;, can be obtained from the joint and 
marginal posterior densities as 


P(x, Oly) 
POjly) 


Of special interest and significance is the set of q conditional distributions p(6,|0_;), 
k =1,...,q, also known as the set of full conditional distributions. These play an 


P(OK|9;, 0; EO, y) = (13.17) 
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important role in the modern computational techniques for obtaining the joint posterior 
distribution presented in later sections. 

The definitions of marginal and conditional posteriors in (13.15) and (13.17) can be 
extended from individual parameters to blocks of parameters. 


Marginal Likelihood 


The marginal probability or marginal likelihood is the denominator in Bayes’ rule 
and is defined as 


fy) = I L(y|@)2(0)d0. (13.18) 


It is the expected value of the likelihood, E[L(y|@)], where the expectation is with 
respect to the prior density. The marginal likelihood constitutes a basis for Bayesian 
inference (see Section 13.8), as it contains information about the support in the data 
for the prior. 


Posterior Predictive Density 


Consider out-of-sample prediction of a single observation y”. This has density 
ft (y?|0), where 0 is unknown. The posterior predictive density of y” weights this 
density by the posterior probability distribution of 0, yielding 


FPO?) = / Fy? |)pOlyad. (13.19) 


If covariates appear in the likelihood function as in a regression model, then these 
densities are conditioned on them also. 


13.2.6. Large-Sample Behavior of the Posterior 


The influence of even informative priors on the posterior diminishes as the sample 
becomes large, as illustrated in the Section 13.2.2 example. This is the basis of the 
statement that asymptotically the likelihood dominates the inference or that the weight 
assigned to the prior essentially goes to zero as the sample size grows. 

Because the posterior distribution can be awkward to manipulate, an asymptotic 
approximation to the posterior is of interest as it can be used in place of the true finite- 
sample posterior distribution. This approximation is easy to obtain since asymptoti- 
cally the posterior equals the likelihood. We follow Gelman et al. (1995), to which the 
reader is referred for additional detail. 

For simplicity assume that observations are iid. Then the log-posterior 
N N 

In p(@ly;) = Inz (0) + X` In f@;10). (13.20) 


1 i=l 


t 


This representation makes it clear that in a large sample the posterior is dominated by 
the likelihood contribution, since the contribution of the prior to the posterior remains 
fixed whereas the contribution of the sample to the posterior grows with N. 
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Assume that the posterior p(@|y) is unimodal and approximately symmetric. We 
consider the asymptotic properties of the posterior mode, denoted by 0, which is then 
the local and global maximum of the posterior. 

To establish consistency of ð, we note that the posterior mode converges to the 
MLE as N — oo, since the second term in (13.20) dominates. The posterior mode is 
therefore consistent if the MLE is consistent. So 0 > 0o if the dgp for y has density 
J (y|@o0) and the usual regularity conditions for ML estimation are satisfied. 

To obtain the asymptotic distribution of @, consider a second-order Taylor series 
expansion of the log posterior density around the posterior mode 0. Then 


3? In p(Oly) 
0030 


oa 1 A, 
In p (Oly) = In p(8|y) + a 0) | 
0=0 


| (0—8), (13.21) 


where simplification occurs because 3 In p(0|y)/30 = 0 when evaluated at the poste- 
rior mode, and we assume that third- and higher order derivatives of 0 can be ignored 
asymptotically. Define 


_ In ply) 


T@) = 
(0) 3000 


6=0 


to be the observed information based on the posterior density In p (@|y), evaluated at 
the posterior mode. Then exponentiating (13.21) yields 


1 PARS A 
p (Oly) x exp (-30-D100-d) , 


which is the kernel of multivariate normal distribution with mean 0 and variance ma- 
trix Z(@)~!. It follows that a posteriori 


Oly < N[0,Z@)"]. (13.22) 


As the sample size N grows large, the likelihood component of the posterior be- 
comes dominant and the influence of the prior becomes negligible. In this case we 
may replace the mode (] by the MLE, which is the mode of the likelihood density. This 
yields a result that is sometimes called a Bayesian central limit theorem (Gamerman, 
1997). Asymptotically, frequentist and Bayesian inferences will be based on the same 
limiting multivariate normal distribution, and hence there should be no significant in- 
consistency between them. 

This result has been labeled as the Bernstein—von Mises Theorem in the literature; 
see Train (2003, chapter 12) for an accessible discussion of the three components 
of this theorem. These components comprise (1) the result that the posterior mean 
converges in probability to the maximum likelihood estimator, (2) that it has a limiting 
normal distribution, and (3) that the limiting distribution of the posterior mean is the 
same as that of the maximum likelihood estimator. These results are all implicit in 
the Bayesian central limit theorem. That theorem is of great interest and relevance to 
those who wish to apply the likelihood principles of estimation and inference. The full 
force of its implications will become apparent after we examine numerical methods 
for approximating the posterior distribution. 
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Do the preceding arguments imply that Bayesian and likelihood-based methods will 
produce essentially similar results? Is the choice between the two approaches may 
largely a matter of computational efficiency? A definitive treatment of these issues 
is not available. However, there are a number of examples in the literature that show 
not only that the two approaches may produce similar results, but also that Bayesian 
methods are frequently computationally more efficient. 


13.2.7. Bayesian Decision Analysis 


Given the full posterior distribution p(@|y), which point estimate of 0 should be re- 
ported? This question was studied in Section 4.2 for best prediction of y using, for 
example, squared error loss. Here instead we consider best estimation of 0 using, for 
example, quadratic loss. 

Let L(0,0) denote the specified loss function, where 6 is an estimate of the unknown 
0. The loss is unknown, as it depends on 0, which is unknown. We can, however, find 
the expected value over 0 of the loss since Bayesian analysis, unlike classical analysis, 
provides the distribution of 0. The optimal estimator Oorr is the estimator 6 that 
minimizes expected posterior loss, or 


min E[L(0,0)] = min f L(0,0)p@ly)d0, (13.23) 
0 0 


Losses associated with different (0,0) are weighted by the posterior probability 
p (Oly). 

It can be shown that the posterior mean is the optimal estimator under quadratic loss, 
L(0, 8) = (0— 8y (0— 0). If instead absolute error loss is used, with L(0, 0) = |o— él, 
then the posterior median is the optimal estimator. Once the posterior distribution 
has been established these point estimates can be computed either analytically or 
numerically. 

Under some conditions minimizing expected posterior loss can be shown to be 
equivalent to minimizing expected posterior risk. The risk function averages the pos- 
sible loss over hypothetical samples of y from the population, so 


R(O,0) = i L(0,0) f(yl6)dy. 


To avoid the possible confusion between loss function and likelihood function, here 
and in the next equation block, we have used f(y|@) as equivalent to the likelihood 
L(y|0). Expected posterior risk averages this risk over different values of the parame- 
ters 0 € © by weighting with respect to the posterior density, so 


E[R(0.0)] = [ | f LOD SOs] p@ly)d0 (13.24) 
z / | i L(0,0)p (oiyya9} flðay 


z I E[L(0,0)] f (yl@)dy, 


434 


13.3. BAYESIAN ANALYSIS OF LINEAR REGRESSION 


where in the first equality the outer integral ranges over the domain of 0, in the second 
equality the order of integration is interchanged, and in the third line the conclusion 
follows. These operations presume that appropriate restrictions on L(0,0) and p (Oly) 
are satisfied. For example, p (@|y) must be a proper density function and the loss func- 
tion must be integrable. Hence expected risk will remain bounded and minimizing it 
is a well-defined operation. 

The foregoing argument establishes a well-known and important result that the 
Bayes estimator is admissible in the sense that it minimizes expected risk for a speci- 
fied loss function. 


13.3. Bayesian Analysis of Linear Regression 


Because the analysis of linear regression is a familiar topic, it provides a useful por- 
tal to more general nonlinear models. The data are assumed to be generated by the 
standard linear regression model 


y=Xß+u, 


where X denotes the N x K full column rank matrix of weakly exogenous re- 
gressors. The errors are assumed to be independent, homoskedastic, and nor- 
mally distributed, with u ~ M[0,o°Iy]. The sample conditional density is therefore 
y|X, 3,07 ~ N[XG,o7Iy]. Our exposition follows Zellner (1971). 

We deal in turn with noninformative and informative priors. In both cases a closed- 
form expression for the posterior can be obtained after some considerable algebra. For 
noninformative prior it will be seen that the OLS estimator has a Bayesian interpreta- 
tion as the mean of the posterior distribution. In the informative prior case it will be 
seen that the posterior moments are weighted functions of the sample and prior means. 

Subsequent sections present methods for less tractable models, but even then anal- 
ysis is simplified if results similar to those given in this section can be applied to some 
subcomponents of the model. 


13.3.1. Noninformative Priors 


For noninformative priors we use Jeffreys’ priors. From Section 13.2.4, for y ~ 
N[, 07] this prior for u (given o? known) is a constant, whereas the prior for o? 
(given u known) is proportional to a”. For the regression case this extends to constant 
prior for Bj, j= 1,..., K, so x (B;) oc, and the prior for o ism (a°) x 1/o?. 
The prior views all values of £; as equally likely, whereas smaller values of o? are 
viewed as being more likely. Assuming independence of 3 and ø? the joint prior is 


x (3,07) x 1/0°. 
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The likelihood function can be reexpressed as 
2 2\-N/2 l 1 
L(B,0°ly, X) = (220°) “exp ) -573 (y — XBY (y — XB) (13.25) 


x (0?) “”? exp (- 5 


(Ta (B-BYX'X(-BN) 
= 1 RB B 
a (o?) ™ exp (- (N — K)s? + (0-B)xxB-B)) 


where 3 = (X’X) X'y and Ñ= y—X@; the second line uses y-X@=0— 
X(3—,) and X'® = 0; and the third line uses s? = WU/(N — K). 
Combining the likelihood in (13.25) and the prior, we obtain the posterior density 


p(B,o"ly, X) (13.26) 


LA" 1 z A a \ 1 
x (=) exp (- [W - K)s + (B-BX'x(6-B))) = 


1 pits 1 2 NE al a 
x (=) exp (-s {(N — K)s? + (6-B)xxB-B)) 


1 K/2 1 ad —1 rea 
x {() exp (-36-By (AXX ') e-D)| 


1 \ W-K)/2+1 (N — K) s? 
5 (=) exp (- wees ) . 


The conditional posterior distribution p(G|o7, y, X) of B, given o7, and the data 
y, X, is clearly the K-dimensional multivariate normal with mean B and variance 
o? (XX) ', since 8 appears only in the first line of the final expression. The con- 
ditional posterior of ø? given 8 is more difficult to obtain as o appears in both lines. 
The marginal posterior of 3, obtained by integrating out o7, is much more use- 
ful for posterior inference about 3. We integrate the second line of (13.26), change 
variables to z = 1/o* and use the result that ma z€ exp (—az) dz = P(c + 1)/a‘t! for 
given constants a > 0, c > —1, where here c = N/2 + 1 and a = {-} is the lengthy 
term in braces. This yields the kernel of the marginal posterior distribution 
p(Bly, X) x {(N — K)s? + (8—BYX'X(B-B)}-"”? (13.27) 
pu “4 a )-(N-K+K)/2 
oc {1 + -A (s*(W - K)(X’'X)"')'(8-B)| 


’ 


which from Section 13.3.5 is the kernel of a multivariate Student t-distribution cen- 
tered at G with N — K degrees of freedom and covariance matrix s? (X’X) ' multi- 
plied by (N — K) / (N — K — 2). Thus 


B ~ tr (B, SXK). (13.28) 


An individual element of 8 has a univariate Student t-distribution. 
The marginal posterior for ø? is more easily obtained, by integrating the final ex- 
pression in (13.26) with respect to @ and noting that G appears in only the first line, 
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which is the kernel of the M [B, o?(X'X) |] density and integrates to one. It follows 
that the marginal posterior for o° is 


—(N-K+1)/2 (N — K)s? 
plo? ly, X) x (07) et” exp (-“*). (13.29) 


This expression is known to be the kernel of an inverted square-root gamma density. 
That is, it is the density of a random variable that is the reciprocal of the square-root 
of a gamma-distributed random variable with degrees-of-freedom parameter N — K. 
This result is identical to that obtained under the frequentist analysis of the distribution 
of B. 

For normal linear regression, Bayesian analysis with noninformative priors there- 
fore yields qualitatively similar conclusions to those from the standard frequentist anal- 
ysis in finite samples. Conditional on o? the posterior of @ is the M [B. o?(X’X)!] 
distribution, and unconditionally the posterior of 6 is the multivariate t-distribution. 

The interpretation is quite different, however, as these distributions are of the un- 
known parameter 3 with mean B, rather than of an estimate B with unknown mean 8. 
For example, the Bayesian 95% HPD interval for 6; is B j E t025,N-K xse[B jb where 
se[B j] = (s?(X’X)//)'/°. From Section 13.2.5 the interpretation is that £; lies in this 
interval with posterior probability 0.95. 


13.3.2. Informative Priors 


Bayesian analysis of the normal linear regression model under informative priors is 
especially insightful if we use independent conjugate priors for @ and o. From Sec- 
tion 13.2.4, the conjugate prior for 8 is the normal, and the conjugate prior for 1/a? is 
the gamma. This leads to the normal-gamma prior 


1(8,1/07) = ty (B\1/07)x, (1/07), 


where zy (B| 1/07) is the V[Go, 0795'] density, with Gy and Qo known, and the 
kernel is 


me (13.30) 


suiaifar)caF ep] (2-A lB A] 


and 7, (1 J o°) is the G [vo. sê] density where vo and s are known constants, and 


2 
ny(1/02) = 07+ exp Bau (13.31) 
Oo 


Note that the prior for the (location) parameter 8B depends on the (scale) parameter 
o. This makes sense as o reflects the scale on which y is measured and hence should 
affect 6. Given this prior and the likelihood in (13.25), the posterior density is of a 
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normal—gamma type. After some algebra it is as follows: 


p(B.1/oly, X) œ (o?)"” exp | 


aeu l (8 - Bo) % (B - a) 


202 


2 
2) —(vo/2)—1 VoSo 
X (o ) i exp - 3 
2 


- (enor fs ea (02)-*? 
1 v e 
x exp |- (3-2) Q (3-2) | (13.32) 


mw) l EPAR 
exp 


20? 202 


where 8 and Q7’ denote the posterior mean and variance of 8 and s? denotes the 
posterior mean of ø? defined as 


B = (R + X’X) | (QB) + XXP), (13.33) 
Q = (Ro +X'X), 


s? = s3 +a- (B-B) [95' + 0X7] (6-8). 


The posterior mean (3 is obtained by using the matrix version of the “completing the 
square” operation. Specifically, given the K x 1 vectors 6, 6, By, and 6, and K x K 
symmetric square matrices A and B, it can be shown that 


(B — Bo)’ A (B — Bo) + (G-BY B(B—B) 
= (B-B) (A +B) (6-3) + (B-B) AB(A +B) (2-2). 


where B = (A +B)! (AB, + BB). 

The joint marginal posterior of 3 and o°? is of the same normal-gamma form as the 
prior. 

The conditional posterior of 3 given o? has mean (3, a matrix-weighted average of 
the prior mean J, and the sample mean B. 

In general using a conjugate prior is algebraically equivalent to augmenting the 
data with a sample from the same distribution. In this case the normal-gamma prior 
is equivalent to an additional sample of the same process with regression parameter 
estimate of Bo, X/X matrix equal to Qo, degrees-of-freedom parameter equal to vo, and 
error sum of squares equal to vse: Since Qo is a fixed matrix, Q/ N > 0asN > œ, 
whereas X'X/N converges to a matrix of constants. Hence 8 > B, verifying that in 
large samples the ML estimator and the posterior mean are equivalent. The posterior 
variance Q;' is proportional to (Ro + X’X) "See Leamer (1978) for a more detailed 
exposition. 
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The marginal posterior of 3 is obtained by integrating o? out of the joint posterior. 
This yields 


ply, X) & E + (B-B) (% +X'x) (8-2) 


—(+K/2) 
| ; (13.34) 


hence a marginal posterior is a multivariate Student t-distribution, one that is centered 
around 3 rather than around B as in the case of uninformative prior. 

Because the conjugate prior treats the prior information like a previous sample from 
the same process, the sample and prior information are handled symmetrically even 
though the information from the two sources may be in conflict. Thus the mathemat- 
ical convenience of using conjugate priors comes at a price. If the prior information 
and the sample information are apparently in conflict, the posterior distribution can be 
expected to be bimodal with the modes corresponding to sample and prior means. A 
prior distribution that allows one to capture such a feature is a prior that specifies that G 
has a multivariate Student t-density independent of 1/o7 and 1/0? has a gamma prior 
distribution independent of XZ. This has been called “Dickey’s prior” (Leamer, 1978, 
p. 79). Under this assumption the marginal posterior is a product of two multi- 
variate Student t-densities; this product can also be expressed as a mixture of two 
t-distributions. Such a distribution can potentially exhibit bimodality. Leamer (1978) 
has provided a more extensive analysis of this case. 


13.3.3. Mixed Estimation 


We seek to place Bayesian analysis of linear regression in a frequentist setting. 

Frequentist analysis usually incorporates prior information as equality constraints, 
which is a limiting case of Bayesian analysis where the variance parameters in the 
prior go to zero. Prior information that is instead stochastic can also be incorporated 
into frequentist analysis, by using mixed estimation. The algebra is simple, and the 
approach also provides an intuitive understanding of how Bayesian procedures pool 
prior and sample information. 

We continue with the linear regression model under normality. Assume prior infor- 
mation for the regression parameters that 3 ~N [0,071 x], where extension to nonzero 
mean is relatively easy. The prior information can be written as 


B=0+v, 


where v is a K x 1 error with v ~N [0,07Ix]. Now augment the sample informa- 
tion y = Xf + u by this prior, and write the full model as an augmented regression 
model 
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This can be reparameterized as 


y X u 
H = 2 ig B+ =o (13.35) 
Oy 


Oy 
X u 
[ilaele] 
where à = o/o, and the transformation v* = —Av has been used so that all errors have 


common variance o’. 


The estimator based on this augmented data set is a pooled estimator or a mixed 
estimator. Conditional on A, the mixed estimator is 


B, =[X'X+ 21k] X'y (13.36) 


= [X'X(Ik +4? (XK) X'y 
= [Ik +22 (X’X) I! (XX) X'y 
= A,B, 


where A, = [Ix +4? (X’X) rh, and B = (X’X) X’y is the unrestricted OLS 
estimator. 

This estimator is the so-called ridge-regression estimator introduced without a 
Bayesian justification by Hoerl and Kennard (1970) to combat the problem of mul- 
ticollinearity in small samples. This estimator also belongs to a class of shrinkage 
estimators, in which the estimator is shrunk toward (or pulled toward) a prior mean, 
in this case the zero vector. This sometimes makes some sense in a finite sample with 
highly multicollinear data where the “t-ratios” tend to zero, making it difficult to dis- 
tinguish between variables whose coefficients are truly close to zero and those that 
only appear to be that way. In the limit shrinkage leads to variable exclusion. 

Several features of B, are noteworthy: (1) Conditional on À, Bi is the mean of a 
posterior distribution of 6. (2) The estimator is a matrix-weighted average of 0 vector 
and B. (3) The algebra changes very little if we chose to shrink the estimator toward 
some nonzero 6, say Bo. Then the resulting estimator is a matrix-weighted average 
of vectors Gy and 8. 

The symmetric weighting matrix A, = [Ig +(A7/N) (N-'X'X) '] —> Igas N > 
oo, since à? /N — 0. Therefore, 


B, > Bas N > œ, 


so the effect of the prior on the posterior mean vanishes as the sample becomes large. 
Similarly, the conditional posterior variance of 6, is given by 


ViB,] = A, VIBIA, 
= 07A,(X’X)'A,, 


so V[B,] > o2(X'X)~! as the sample size N > 00. 
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For finite samples, conditional on A and o”, the conditional posterior distribution 
of B, is 


B,\A, 0? ~ NIAGB, 07A,(X'X) AS]. (13.37) 


The marginal posterior distribution of B , is obtained by integrating out A and o°. Treat- 
ing A as given, and assuming a vague or uninformative prior on 07, we can integrate 
out g? as was shown in Section 13.3.1. This integration operation is analytically feasi- 
ble and yields a marginal posterior of 6, that is the multivariate Student t-distribution. 
Finally, we can specify a prior distribution on A, possibly a gamma prior since A > 0, 
and proceed to integrate it out. However, à enters the conditional posterior in an awk- 
ward fashion and cannot be integrated out analytically. At this stage we would need 
to resort to a numerical technique. Assuming that this is accomplished then we have a 
Bayesian treatment of this model. 


13.3.4. Hierarchical Priors 


We consider a three-stage linear regression model that is hierarchical in regression 
parameters but not in variance parameters. 

The first stage is a linear regression model denoted y = X,(, + u, where the sub- 
script 1 is added to distinguish between first- and second-stage parameters and regres- 
sors. The parameters (3, are random and are modeled to depend on both parameters 
and data, so G,= X26, + v. For example, the first level models individual student 
test performance and the second level brings in school characteristics. The errors are 
assumed to be normally distributed. The second-level parameters 6, are treated as un- 
known and a prior is specified. A prior is also specified for the variance parameter o? 
in the first-stage model. 

Assuming normally distributed errors and using conjugate priors leads to the fol- 
lowing model: 


ylX1,8,,07 ~ N[X1G,, of Ly], (13.38) 
B,|X2, Bo, Bo ~ N[X2), Eo], (13.39) 
B, ~ NIB", B*1, (13.40) 

oy, PZA GIv*/2, v*o*?/2], (13.41) 


where X; is N x K, X, is K x M, Bı is K x 1, B, is M x 1, Sy is K x K, 3" is 
M x 1, and &* is M x M. For the regression parameter (3, the second line gives the 
prior, and the third line gives the subsequent second-stage prior, or a prior on a prior, 
for B, (while X, is assumed known). The parameters (3*, &:*) are often referred to as 
hyperparameters. For variance parameters, the fourth line gives a prior for the variance 
parameter o? with v* and o** specified. The innovation is the addition of (13.40). 
Note that we can collapse the stages and convert this into a two-level model. 
Specifically, we can write a two-stage model with an informative prior in one of two 
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ways, either 


yIX1, 61, of ~ N[X1B;, 0/1], 
B,|X2, £2 ~ N[X20*, X2 + X2 X*X,] 


or 


ylXı, X2, By, D2, of ~ N[X1 X28, of Iy + X1D2X}j], 
Ba ~ NIB*, =*1. 


If o? were, given this setup corresponds to conditionally conjugate normal priors. 
Using results introduced earlier we can derive expressions for the posterior means of 
either 3, or B, as matrix-weighted averages of either 8* and B , or of 3* and Bo. 

The use of the normal distribution is only illustrative. Hierarchical models for gen- 
eralized linear models, members of the linear exponential family, have been widely 
used (Albert, 1988). 

In hierarchical models it may not be possible to obtain the full posterior probabil- 
ity distribution of first-stage parameters such as G, in an analytically tractable form. 
Fortunately, the advances in computational methods presented in the next section are 
especially well suited to models with a hierarchical structure. 

Another approach, which is an application of the empirical Bayes method, involves 
estimation of parameters in the higher stage priors, similar to that in the likelihood 
approach. This approach avoids, for example, assuming that X% and X* are known 
matrices. 


13.3.5. Multivariate t- and Wishart Distributions 


Bayesian analysis makes use of a wider range of distributions than classical analysis. 
Here we present details on two multivariate distributions that are used in Bayesian 
analysis of linear regression under normality. 

The multivariate t-distribution is a multivariate extension of the univariate student 
t. It is similar to the multivariate normal, except that the tails of the distribution can be 
considerably fatter. In Bayesian analysis it arises as the marginal posterior for 8 given 
a conjugate normal prior (see Section 13.3.2) or can be used directly as the prior for G 
if tails fatter than the normal are desired. A q x 1 random variable ¢ that is multivariate 
Student-r distributed with degrees-of-freedom parameter v, mean parameters jz, and 
dispersion parameters &, has joint density 


r(@ + 1)/2) 


ftlo, H, x) = T(v/2)rv)4/2| E| 


1 —(v+q)/2 
x fije- wae- w] , 


where T (-) is the gamma function. This distribution is symmetric with mode u, mean 
ifv > 1, and variance [v/(v — 2)] if v > 2. The tails can be much fatter than the 
normal (e.g., the variance is 3X if v = 3) and the normal is obtained as v > ov. If 
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z~ N[0,I] and s ~ x?(v) then t= u+ 7!/2z/,/s/v has the multivariate t- 
distribution given here, providing an easy way to obtain draws. 

The Wishart distribution is a multivariate extension of the univariate chi-square 
distribution, or more generally the gamma distribution. In Bayesian analysis it is used 
as the conjugate prior for the inverse of the covariance matrix of a multivariate normal 
distribution. A q x q random positive definite matrix W that is Wishart distributed 
with degrees of freedom parameter v > q and scale matrix S has joint density 


q x 
FwWlv, S) = 2rd T] T (24-4) 
J= 


x |S" W01- exp (—tr(S-1W)/2) , 


where T (-) is the gamma function and tr(-) denotes the trace of a matrix. This dis- 
tribution has mean vS. The sample covariance matrix for iid multivariate normal 
data is Wishart distributed. More generally, given v(q), independent q x 1 vectors 
Xp NTO, S], j = 1,..., v, then Ya x jx; is Wishart distributed. If W~! is Wishart 
distributed with density fy(W~'|v,S) then W is inverse-Wishart distributed with 
density 


Siw(W |v, S) 


— 29/25,4(q-1)/4 ll r (==) |S? Ww] @ tat D/2 exp (—tr(S~'W)/2) i 
j=l 


13.4. Monte Carlo Integration 


In many modeling situations the posterior distribution of the parameters of interest is 
analytically intractable. In such cases numerical methods are needed to estimate either 
the full posterior distribution or some key moments of this distribution such as the 
posterior mean. 

In this section we consider computation of key posterior moments, without explic- 
itly obtaining the posterior distribution. The methods of Chapter 12 can be applied, 
with potentially less computational burden since the integral needs to be computed 
once for the entire sample rather than for every individual at every iteration. In the 
subsequent section we present methods to simulate the posterior distribution. 


13.4.1. Importance Sampling 
Suppose the problem is to evaluate the posterior moment function E[m(@|y)], where 
expectation is with respect to the posterior density p(@|y). We wish to compute 


E[m(@)] = i m(0) p(Oly)d0. (13.42) 
R(0) 


For example, the posterior mean of the kth parameter is E[6,] = f 6x p(O|y)d0. Other 
examples include posterior standard deviations, marginal posterior densities, posterior 
intervals, and posterior expectations of a given function of parameters. 
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From Chapter 12 a direct Monte Carlo integral estimate of E[m(@)] is E [m(@)] = 
so! >>, m(6"), where 6°, s = 1,..., S, are S draws of @ from the posterior density 
p(0|y). However, this estimate is infeasible in the current Bayesian setting if there is 
no closed-form solution for the posterior density defined formally in (13.1), as then 
it is not possible to make draws from the posterior p(@|y). Instead, we use impor- 
tance sampling, introduced in Section 12.7.2. The integral considered in (13.42) can 
be rewritten as 


E [m(0)] = / 


R0) 


( m(8) p(Oly) 
g0) 
where g(0) > 0 is a known density function, with the same support as p(0|y), that is 

easy to make draws from. The corresponding Monte Carlo integral estimate is 


) g(0)d0, (13.43) 


s 1S m@*)p@ ly) 
E [m(0)] = ; ; 
S 32 g(@) 
where 6°, s =1,..., S, are S draws from of 8 from the importance sampling den- 


sity g(0) rather than from the original target density p(@|y). Note that the requirement 
that p(@|y) and g(0) should have the same support is potentially problematic if p(@|y) 
depends on additional parameters or if the functional form of the full conditional den- 
sities is known but that of the marginal posterior is not. 

Application to the posterior density additionally needs to account for the constant 
of integration in the denominator of (13.1). Let p**'(@|y) denote the kernel of the 
posterior density, where p‘*"(@|y) = L(y|@) 7z (0) or a multiple of this quantity. How- 
ever, for notational simplicity the dependence on y is suppressed in what follows. The 


posterior density is then 
ker 
ps6) 
0) = ——_—_., 
p0) T p(0d0 


with corresponding posterior moment 
ker 
E[m(0)] = fro Grou) d0 
_ fm@) p™(0)d0 
J pP Odo 
_ J (m@) p*()/g(0)) g(0)a0 
OLORI 


The importance sampling-based estimate of the posterior moment E[m(0)] is then 


5 sai (O)p"O")/8(6") 
5 sai P*(8")/8(6") 
where 6°, s =1,..., S, are S draws of 0 from the importance sampling density g(0). 

This method was proposed by Kloek and van Dijk (1978). Geweke (1989) estab- 


lished consistency and asymptotic normality under some regularity conditions. These 
conditions include the assumptions that the importance sampling density g(@) > 0 


E[m(6)] = (13.44) 
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over the support R(@) of p(@); that E[m(0)] < oo, so the posterior moment exists; and 
that f p(@ly)d@ = 1, so the posterior density is proper. As previously noted, usually 
we work with the kernel p‘*'(@|y) = L(y|@) z (0), which need not integrate to one. 
The prior 7 (0) need not be proper, but to ensure that f p(@|y)d@ = 1 it is necessary 
that f 2(@)d@ < oo. 

The importance sampling approach is simple, but implementation entails subtleties 
well explained in Geweke (1989). A critical requirement is that the g(@) should 
have thicker tails than the p(@|y), to ensure that the importance weight w(@) = 
p(9|y)/g(@) remains bounded. In view of the asymptotic normality of the log pos- 
terior, a good choice of g(9) is a multivariate t-distribution, with the mean set to the 
posterior mode, and the covariance matrix proportional to the inverse of the Hessian 
of the log of the posterior, and degrees of freedom set to a value sufficiently small to 
ensure thick tails. Geweke (1989) also provides a measure, called the relative numer- 
ical efficiency, that estimates the number of replications required to achieve a given 
level of precision of E [m(@)] computed using draws from g(0) relative to the number 
of replications needed if draws from p(@|y) were possible. From Chapter 12, for a 
higher dimensional integral more simulation draws are required to get a good approxi- 
mation to the integral and one might additionally use simulation acceleration methods 
presented in Chapter 12, such as antithetic sampling. 

The importance sampling method uses each draw 6° from the sampling density 
g(@) with equal probability. A more efficient approximation would weight the draws 
according to how close g(@°) is to the target p(0*|y). This can be done by importance 
resampling (see Gelman et al., 1995). 

The importance sampling method can be used to provide many useful summary 
measures of the posterior, as presented in Section 13.2.5. This includes estimates of 
the quantiles and percentiles of the posterior, permitting calculation of 95% posterior 
intervals and plots of the posterior density of 0x. 


13.5. Markov Chain Monte Carlo Simulation 


A modern idea in Bayesian analysis is that rather than concentrating on the estimation 
of key summary measures of the posterior distribution (see the previous section) it is 
desirable to obtain a large sample from the posterior distribution. Then the summary 
statistics of this sample from the posterior will provide desired information about the 
moment characteristics of the sample of estimates and about other interesting associ- 
ated measures such as marginal distributions of parameters or functions of parameters. 
For example, given S draws from the posterior distribution, E[@,] can be estimated by 
S'S O. 

The challenge is to make draws from the joint posterior distribution when there is no 
tractable closed-form expression for the posterior density. If a suitable density exists 
for computation of posterior moments using importance sampling, then it might also be 
suitable for making draws from the posterior using the accept-reject method presented 
in Section 12.8. However, this method can be very inefficient as a high percentage of 
draws may be rejected. 
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Instead, sequential draws are made yielding simulated values that, if the sequence 
is run long enough, converge to a stationary distribution that coincides with the tar- 
get posterior density p(@|y). The method is called Markov chain Monte Carlo 
(MCMC), because it involves simulation (Monte Carlo) and the sequence is that of 
a Markov chain. After convergence of the chain, S sequential draws can be used to 
compute summary measures for the posterior, such as estimating E[@,] by EIA] = 
S~' $, 0}. The draws are positively correlated, however, so the precision of the esti- 
mate will be reduced for given S because its estimated variance will exceed the usual 
(S — 1)! E, O — El). 

The sequential method entails constructing a Markov chain. Two widely used al- 
gorithms are the Gibbs sampler and the Metropolis—Hastings algorithm, the former 
being a special case of the latter, see Hastings (1970). Excellent detailed treatments of 
the subject can be found in Gelman et al. (1995), Gamerman (1997), and Robert and 
Casella (1999). What follows is a bare-bones sketch. 


13.5.1. Markov Chains 


Before presenting the Gibbs sampler and the Metropolis—Hastings algorithm we pro- 
vide some key definitions and concepts used in the MCMC literature. These definitions 
are given in the context of a model with discrete states. They can be extended to the 
continuous state model, relevant to applications where the posterior is continuous in 
the parameters. 

A Markov chain is defined as a sequence of random variables x, (n = 0,1, 2,...), 
where x, takes values in a finite space A, together with a transition kernel K(-) that 
defines the probability that x, equals a particular value given previous values x„— j. We 
consider a Markov chain with the property that 


Pr [Xn] = X|Xn,Xn—1,---, Xo] = Pr [tna = xxn], (13.45) 


so that the distribution of x,+4, given the past is completely determined only by the 
preceding value x,,. The transition kernel is a transition matrix T with element 


try = Pr[xn41 = ylXn = x], (13.46) 


which informally is the probability of transition from x to y. For a finite-state Markov 
chain the set A of values (or states) that x, may take is finite with, say, m elements. 
Then 


tii ae tim 
Psy Me. cathy (13.47) 


tm aa tmm 


with yy fij = 1, i= 1,...,m. 

Now consider the transition from x to y in n steps (stages). The transition probabil- 
ity is given by T”, the n-times matrix product of T. The rows of the matrix T” give the 
marginal distribution across the m states at the nth stage, and the jth row vector t = 


eae 1? 


j1>+++>tim) gives the marginal distribution of transition probabilities from state j to 


446 


13.5. MARKOV CHAIN MONTE CARLO SIMULATION 


the other states at stage n. If the initial distribution of transition probabilities is denoted 
i then oe = er? = te T. So the marginal distribution of transition probabilities 
at the nth stage is determined solely by the initial distribution and the transition matrix. 

In the Markov simulation context, the asymptotic behavior of the chain as n > 
co is of interest. The chain is said to yield a stationary distribution or invariant 
distribution with transition probabilities tyy if 


YotTry=ty Vy eA, (13.48) 


xEA 


where transition is from state t, to t,. Then applying the transition matrix leads to 
no change in the marginal distribution of transition probabilities. The existence and 
uniqueness of a stationary distribution is an important issue. 

If the stationary distribution exists, and if lim„,—>oo tT}, y= ty, then the chain will 
asymptotically approach t, independently of the initial distribution. In this sense t, is 
a limiting distribution. Although here the stationary distribution is defined for a finite- 
state Markov chain, MCMC methods can handle Markov chains that are not finite 
state; see Gilks, Richardson, and Spiegelhalter (1996, pp. 60—61). 

A state y may be recurrent or transient. A recurrent state is one that will be revis- 
ited with probability one, and a transient state is one that will not be revisited with 
some positive probability. 

For Bayesian applications the goal is to obtain draws from the posterior p(@). Ap- 
plying a Markov chain to obtain these draws, the initial value of a parameter vector, 
6 (which is analogous to the distribution of states), is assigned or sampled from 
the transition kernel. Using a suitable method of drawing pseudo-random numbers, a 
new vector of values 9 is drawn from the transition kernel evaluated at 9, that is, 
K (a), At the nth stage the draws are from a transition kernel K (9~)) and so forth. 
The Markov chain used is one such that as n — oo the limiting distribution is the pos- 
terior p(@). Once convergence to the limiting distribution occurs all subsequent draws 
are also from this distribution, though they will be correlated. 

These ideas provide the intuitive basis for a class of MCMC procedures that can be 
used to recover Bayesian posterior distributions for many different, and possibly high- 
dimensional, models such as, for example, the linear hierarchical models discussed in 
Section 13.3.4. Provided that one specifies a transition kernel K(0—", -) from which 
draws of 0 can be made and within which is embedded the chain’s limiting distribution, 
the target posterior distribution can be recovered in the sense of being approached 
arbitrarily closely. 

The current description is at a very general level. In practice, the choice of the tran- 
sition kernel is not unique and there are many possible chains one can construct. Some 
choices may be better than others in terms of speed of convergence to the limiting 
distribution. If convergence is found to be very slow and computationally expensive, 
alternative chains may need to be substituted. Clearly, criteria are needed to determine 
whether convergence has occurred and how close to the target distribution the chain is 
at the nth stage. 
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13.5.2. Gibbs Sampler 


We begin with the Gibbs sampler, a member of the MCMC class that is easy to describe 
and implement. 

Let 0 = [0; 0]' have posterior density p(@) = p(01, 02), where for notational sim- 
plicity we suppress dependence on y. If the conditional densities are known, which is 
not guaranteed as knowledge of both p(0@,|6@2) and p(02|01) is necessary, then alter- 
nating sequential draws from p(@,|@2) and p(@2|0,) in the limit converge to draws 
from p(@,, 02). 


Example 


A simple illustration is to consider bivariate normal data with uniform prior for the 
mean and known covariance matrix. Let y = (y1, y2) ~ N[@, ©], where 0 = [6, 02] 
and & has diagonal entries | and off-diagonal entries p. Then given a uniform prior 
for 0 the posterior can be shown to be Oly ~ NIF, N~!S], a bivariate normal. Since 
the conditional posterior distributions are 


01102, y ~ N [Gi + p @ — 52), A — p°)/N], 

02101, y ~ N [G2 + p @1 — 51), A — p°)/N], 
we can iteratively sample from each conditional normal distribution using updated 
values of 6; and 62. If the chain is run long enough then it will converge to the bivariate 
normal. In this example it is easy to make direct draws from the joint posterior of 0]y, 


using Choleski’s transformation given in Section 12.8, but in other examples it can be 
possible to draw from the conditionals but not the joint posterior. 


Gibbs Sampler 


More generally, consider a g-dimensional target distribution p(@), where the notation 
suppresses the dependence on data. Suppose that @ is partitioned into d blocks. For 
example, 0’ = [B o7]' in a linear regression example. Let 0, denote the kth block 
and @_;, denote all components of @ aside from 0z. Assume that the full conditional 
distributions p(@;,|0_,), k = 1,...,d, are known. Then sequential sampling from the 
full conditionals can be set up as follows: 


1. Let the initial values of 0 be 8 = (0,..., 0%). 


2. The next iteration involves sequentially revising all components of @ to yield 0” = 
0P Orar oP ) generated using d draws from the d conditional distributions as follows: 


0 0 
p (110s, ..., 09”) 


1 1 0 0 
p (0P10P. 09"... 6”) 


1) 1 1 1 
pler ey nse) 


448 


13.5. MARKOV CHAIN MONTE CARLO SIMULATION 


3. Return to step 1, reinitialize the vector 0 at 0“, and cycle through step 2 again to obtain 
the new draw 9°. Repeat the steps until convergence is achieved. 


Gilks et al. (1996, p. 7) provide a sketch of the proof of the statement that the 
stationary distribution is the posterior. After convergence the draws are from the target 
joint posterior. Geman and Geman (1984) showed that the stochastic sequence {a} 
is a Markov chain with the correct stationary distribution. Gelfand and Smith (1990) 
showed that, under some conditions, as the number of cycles of draws from the full 
set of conditionals tends to infinity, the chain converges to the stationary posterior 
distribution. See also Tanner and Wong (1987). Once convergence occurs, numerous 
draws can be made and used to calculate sample analogues of the posterior moments 
of marginal or joint distributions. 

The results mentioned here do not tell us how many cycles are needed for conver- 
gence, which is model dependent. It is very important to ensure that sufficient number 
of cycles are executed for the chain to converge. A variety of diagnostic tests of con- 
vergence are available. Because estimates of posterior moments should be based on 
draws from the posterior distribution it is standard practice to discard the earlier results 
from the chain, the so-called burn-in phase. 

Sequential simulation algorithms can be modified so that each draw depends not 
simply on the immediately preceding draw but also on earlier draws, the key require- 
ment being that probability of improvement on the current approximation to the pos- 
terior should be positive and (preferably) high. The attraction of the more restrictive 
Markovian property is that it facilitates the proof that the transition distributions con- 
verge to the target posterior. 

For Bayesian analysis the Gibbs sampler is useful when the joint posterior is in- 
tractable but the full conditional distributions are available in a convenient form. Many 
applications use considerable ingenuity and knowledge of conjugate priors and related 
Bayesian results, many from the earlier presimulation literature, to specify priors that 
lead to known full conditional distributions. 

We consider two examples that apply the MCMC methods. 


Linear Regression Example 


In Section 13.3.2 we analyzed the posterior distribution of the normal linear ho- 
moskedastic regression model, given normal—gamma conjugate priors. The conditional 
posterior of B given o~? was shown to be multivariate normal, and the conditional pos- 
terior of o~? given @ is the gamma. Even though integration is feasible and we can 
derive the posterior in an explicit form (see (13.32)) it is actually easier to use the 
Gibbs sampler to draw a large sample from the joint posterior distribution. The chain 
consists of recursive draws from the normal conditional on the precision parameter 
o~? and from the gamma distribution conditional on the 8. 

The structure of the algorithm resembles that given later in Section 13.6 for a 
slightly more complicated case of a two-equation seemingly unrelated regressions 
model. 
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In many cases it would be natural to work with blocks of parameters. For example, 
in a multiequation multivariate linear regression model with a nondiagonal contem- 
poraneous covariance matrix, the conditional mean parameters (Bı, Bo,.. .) form one 
block of parameters, and & forms a second. Then the full conditional distributions 
will have the form 61, B2, .. |data, © and 4|data, B1, B2, .... Chib and Greenberg 
(1996, pp. 418—419) outline the Gibbs algorithm for this case. 


Hierarchical Prior Example 


The Gibbs sampler has been deployed with much success in the analysis of the hi- 
erarchical prior model. From the structure of the linear hierarchical model given in 
(13.39)-(13.41), it can be seen that formulating a Markov chain based on a full set of 
conditionals is feasible in this case. The same general approach can be extended to a 
nonlinear hierarchical prior model, although some additional steps are necessary if the 
nonlinearity occurs in conjunction with a latent variable model (Albert, 1988). 


13.5.3. Metropolis Algorithm 


The Gibbs sampler is the best-known MCMC algorithm. Its applicability is limited, 
however, as it requires direct sampling from the full conditional distributions, which 
may not be known. Two extensions that allow the MCMC to be applied more gener- 
ally are the Metropolis algorithm and the Metropolis—Hastings algorithm. Chib and 
Greenberg (1995) provide a tutorial and references. The following summary is sim- 
pler but avoids many details that are necessary if the reader seeks a more complete 
understanding. 

The Metropolis algorithm constructs a sequence a”, n= 1,2, ...} whose distri- 
butions converge to the target posterior, assumed to be computable up to a normalizing 
constant. 

For notational simplicity we again suppress dependence of p (|y) on y. The algo- 
rithm consists of the following steps: 


1. Draw a starting point 6% from an initial approximation to the posterior for which 
p(0®) > 0. For example, the draw may be from a multivariate t-distribution centered 
on the mode of the marginal posterior distribution. 

2. Next set n = 1. Draw 6* from a symmetric jumping distribution J, (0®|0®), with 
the property that for any arbitrary pair (0°, 0”), J,(0°|0") = J,(0°|0“). An example is 
8/6 ~ N[@, V] for some fixed V. Symmetry of the jumping distribution leads to 
simplicity but is not otherwise essential. 

3. Calculate the ratio of densities r = p(@*)/ po. 

4. Set 


9) = 0* with probability min(r, 1), 
~ | 6 with probability (1 — min(r, 1)), 


which means that the draw 0”) is a draw from a mixture distribution with components 
0* and 0°. 
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5. Return to step 2, increase the counter, and repeat the following steps. 


6. After a suitably large number of iterations apply the necessary checks for the conver- 
gence of the distribution. If convergence has occurred the target posterior has been 
recovered. 


This algorithm can be viewed as an iterative method to maximize p(@). If 0* in- 
creases p(@) then 6“) = 6* always, whereas if 0* decreases p(@) then 0) = 6* with 
probability r < 1. 

The algorithm is similar in spirit to accept—-reject sampling (see Section 12.8), 
though there is no requirement here that a fixed multiple of the jumping distribution 
always covers the posterior. 

The Metropolis algorithm generates a Markov chain that has properties of re- 
versibility, irreducibility, and Harris recurrence that ensure convergence to a stationary 
distribution. Gelman et al. (1995) demonstrate that this stationary distribution is the 
desired posterior p(@) as follows. Let 0, and @, be two points such that p(@,) > 
p(Oq). If 0”) —@, and 6* = 0, then 6” = 0, with certainty and Pria” = 
0,,0° ) = 0,] = Ja(0b|0a)p(04). If the order is reversed and o"-) — @, and 
0* = 0a, then 9@™ = 0, with probability r = p(0,)/ p(@p) and Pra” = 6,,0" P= 
94] = Jn(Oal 95) P(A) P(a)/P(Oo)] = In(9alO4) pa) = In(94|Aa) (Oa) given the 
assumption of symmetric jumping distribution. The marginal distributions of 6% and 
6") are therefore equal, since their joint distribution is symmetric, so p(@) is the 
symmetric stationary distribution of the Markov chain. 


13.5.4. The Metropolis—Hastings Algorithm 


The performance of the Metropolis algorithm varies with the choice of initial approxi- 
mating distribution and choice of jumping distribution. A potential problem is that the 
Metropolis algorithm may be slow, as would be the case if the move from the current 
to a new value is not made sufficiently often, causing the chain to move infrequently. 
The algorithm can be speeded up by permitting use of jumping distributions that are 
not symmetric. 

The Metropolis—Hastings (M-H) algorithm is the same as the Metropolis algo- 
rithm, except that in step 2 the jumping distribution need not be symmetric, and in 
step 3 the acceptance probability r for general n becomes 


p(0*)/J,(0* |" ?) p(0*)J,(0"? |6*) 


~ PO ?)/ IB" A pO?) (B10). 


n 


The remaining steps are executed with this revised definition. Note that if any normal- 
izing constants are present in either p(-) or J,,(-), then they cancel in this definition 
of ra. So both posterior and jumping probabilities need only be computed up to this 
constant. See Hastings (1970). 
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13.5.5. M-H Examples 


Different jumping distributions lead to different M—H algorithms with different ef- 
ficiency in terms of the number of draws needed to obtain the desired draws from 
the posterior. We give several examples, noting that there are few general guidelines 
available for choice of jumping distribution, except to use the Gibbs sampler wherever 
possible. 

The Gibbs sampler is a special case of the M-H algorithm. If 8 is partitioned into d 
blocks, then there are d Metropolis steps at the nth step of the algorithm. The jumping 
distribution is the conditional distribution given in Section 13.5.2 and it can be shown 
that the acceptance probability is always 1. Gibbs sampling is also called alternating 
conditional sampling. 

It is possible to use mixed strategies, whereby different transition kernels are used 
for different subsets of parameters. For example, an M-H step can be combined with 
a Gibbs sampler, the latter being used for components for which direct sampling is 
feasible. 

The independence chain makes all draws from a fixed density g (0), say, in which 
case the acceptance probability simplifies to the ratio r, = w(0*)/w(0~) of impor- 
tance weights w(@) = p(@)/g(@). A random walk chain sets the draw 6* = 9"~ + 
€, where € is a draw from g(e). 

Gelman et al. (1995, p. 334) consider simulating the g-variate normal with variance 
X. For a Metropolis algorithm with jumping distribution 6*|9"—? ~ N[@~), °X], 
the choice c ~ 2.4/,/q leads to greatest efficiency relative to direct draws from the 
q-variate normal. The efficiency is about 0.3, compared to 1/q for the Gibbs sampler 
in the case that © = o7Iy. 


13.6. MCMC Example: Gibbs Sampler for SUR 


We illustrate the application of the Gibbs sampler to the analysis of the seemingly 
unrelated regression model. This example is slightly more challenging than an ap- 
plication to single-equation regression, because errors correlated across equations are 
introduced. 

We consider a two-equation example with ith observation 


/ 

yu = X bı + eu, 
/ 

Voi = Xp; Bo + £2), 


where (£1, €2) are bivariate normal with zero mean and covariance matrix 
oO Oo 
ee ee j. 
021 022 
Combining the two equations gives the ith observation 
/ 
yi =x,O+6i, 
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where €;~NV[0, X]. In summary, the dgp is 
yilXi, B., = ~ NXB, =] 


and interest lies in estimating the posterior means of the regression parameters 8 and 
variance parameters X, given data y, X. 
We consider independent informative priors, with 


B ~ NTBo, B3 '], 
>! ~ Wishart[vo, Do], 


where Bo is defined as precision, the inverse of the prior variance, and the inverse 
Wishart, defined in Section 13.3.5, is a generalization of the inverse gamma. An al- 
ternative approach, not taken here, uses dependent priors similar to those in Section 
13.3.2, in which case 3|X ~N [Bo, oD] for specified wo. 

Performing some algebra yields the conditional posteriors 


N 
BIZ, y,X~N E (8034 F2 xz~'y) c| , 
i=1 i 


N -1 
=~ '|GB, y, X ~ Wishart [e +N, (or +E wu) | l 


iel 


where Co = (Bo + so x, &7!x;)~! and u; = y; — x. 8. The Gibbs sampler can be 
used since the conditional posteriors are known and sampling from both distributions 
is straightforward. 

For a simulation example we let the regressors in each equation be an inter- 
cept plus a single scalar regressor, different in the two equations, generated from a 
standard normal. Then yı and y2 are generated with the four regression parameters 
Bu = Bi2 = b21 = n = 1, the error variances 01; = 022 = 1, and the error covari- 
ance 012 = 021 = —0.5. The sample size is either N = 1,000 or N = 10,000. Given 
these data, we present Bayesian estimates of the parameters, where the prior distri- 
butions set By = 0, By ' = tI, Do = I, and w = 5. To check the impact of different 
priors three values of t are considered, t = 10, t = 1, and t = 1/10, with smaller 
values of t corresponding to tighter priors. 

The Gibbs sampler makes draws recursively from the conditional posterior distri- 
butions. We reject the first 5,000 replications that constitute the “burn-in” phase and 
report results using the subsequent 50,000 and 100,000 replications. 

A selection of the results is given in Table 13.3, which reports the mean and variance 
of the marginal posterior distribution of each coefficient in five different samples that 
themselves are independent draws. The first three columns present a sensitivity anal- 
ysis for different values of t, which shows that the results are not very sensitive. The 
fourth column, compared to the first, shows that doubling the number of replications 
has very little effect. The fifth column, compared to the first, shows that increasing the 
sample size tenfold to 100,000 greatly increases the precision as expected, reducing 
the standard deviation of the coefficient by a factor of more than 3, but with relatively 
small impact on the point estimates. 
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Table 13.3. Gibbs Sampling: Seemingly Unrelated Regressions Example“ 


Prior parameter T 7=10 T=1 7=1/10 7r=10 7=10 
Sample size N 1,000 1,000 1,000 1,000 10,000 
Gibbs sample replications 50,000 50,000 50,000 100,000 100,000 
Bi, (eq. 1 intercept) 0.971 1.013 0.983 1.020 1.010 
(0.0310) (0.0312) (0.0316) (0.0324) (0.0100) 
Bi2 (eq. 1 slope) 1.026 0.9835 1.006 1.006 1.015 
(0.0265) (0.0271) (.0265) (.0268) (0.0086) 
Ba, (eq. 2 intercept) 1.016 0.972 0.993 1.017 0.991 
(0.0309) (0.0325) (0.0322) (0.0326) (0.0100) 
Pnz (eq. 2 slope) 0.983 0.992 0.979 1.005 1.007 
(0.0256) (0.0285) (0.0272) (0.0277) (0.0085) 
011 (eq. | error variance) 0.960 0.969 1.012 1.043 1.010 
(0.0429) (0.0434) (0.0453) (0.0466) (0.0143) 
012 (error covariance) —0.499 —0.507 —0.519 —0.576 —0.515 
(0.0340) (0.0358) (0.0368) (0.0379) (0.0113) 
on (eq. 2 error variance) 0.950 1.066 1.049 1.062 1.002 


(0.425) (0.0476) (0.0467) (0.0472) (0.0141) 


“ Model is a two-equation seemingly unrelated regression. Table gives the mean and standard deviation of the 
posterior distribution for each parameter. Smaller values of t correspond to tighter priors. 


One way to check for convergence is to look at the means and standard deviations 
of the output and see whether they drift or stay at the same level. If the change is 
small, say less than 0.1 for 10,000 replications, then convergence is presumed. One 
also might look at several chains at a time. The draws will always be correlated but the 
important question is how fast the autocorrelation function decays to zero. Sometimes 
this problem cannot be fixed and it is simply inherent to the algorithm. One can also 
take every tenth or hundredth observation to purge serial correlation. 

To check whether the Gibbs sampler has converged to the stationary posterior dis- 
tribution in the present case, we compute the first 20 autocorrelation coefficients of 
draws from the posterior after convergence for each coefficient. Lack of convergence 
would be indicated by the presence of serial correlation in the draws from the target 
distribution. When the number of replications is small, say 1,000, the autocorrelation 
coefficients are found to be as high as 0.06 in some cases. However, when the number 
of replications is 50,000 and greater, there is virtually no evidence of serial correlation 
up to order 20, and correlation disappears with the order. In most cases the estimates 
are smaller than 0.005. It is easy to verify that for N = 1,000, the prior parameters t 
has very little impact on the posterior. This computation is very simple and takes little 
more than a few seconds. 


13.7. Data Augmentation 


The Gibbs sampler can sometimes be applied to a wider range of models by introduc- 
tion of auxiliary variables. In particular, this is the case for models involving latent 
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variables, such as discrete choice models, truncated and censored models, and finite 
mixture models introduced in later chapters. 

In the scalar case the latent dependent variable y* is not observed; instead, we ob- 
serve only y = g(y*) for some specified function y. For example, in a logit or probit 
model (see Chapter 14) we may observe only whether y* is positive or negative, in 
which case y = 1(y* > 0) and we observe y = 1 if y* > Oand y = Oif y* <0. 

Bayesian analysis of latent variable models, and especially the application of the 
Gibbs sampler, is greatly aided by the replacement of the latent variable by imputed 
values. This step is feasible if we can write down the predictive density of the latent 
variables in terms of the observed variables. The procedure of adding imputed values 
as if they were observed data is called data augmentation. (An example was given 
in Section 10.3.7 where the EM algorithm was exposited.) The essential insight, due 
to Tanner and Wong (1987), is that the posterior based only on the observed data is 
intractable, but that obtained after data augmentation is often tractable using the Gibbs 
sampler. 

Consider the posterior expressed in terms of both directly observed variables y and 
the latent variables y*, 


ply) = f pOly, YF FIÐAF, (13.49) 
: 


where the right-hand-side integral may be interpreted as an averaging operation with 
respect to y*. 

Analogous to the EM method, data augmentation involves cycling between an im- 
putation step, I-step, and a posterior step, P-step. 

In the imputation step we make draws from the full conditional density of y*. This 
averages over the parameters % that appear in the probability distribution that links y* 
and y. The predictive distribution is 


f= | FOY, DEOIN. (13.50) 


Given the current draw from p(@|y) we can make a draw of y* from f(y*|y), repeating 
both parts of the step m times to generate m multiple imputations y¥, i = 1,...,m. 
This completes the I-step. 

Given the augmented data from the I-step, the P-step is implemented by updating 
the current approximation to p(@|y); thus, 


m 


1 
updated p(ly) = — ) | ply. yi). (13.51) 
i=1 


Then the algorithm returns to the I-step. 

If m = 1, the procedure amounts to performing integration in (13.49) by Gibbs 
sampling. If m is chosen to be sufficiently large, the posterior distribution is approx- 
imated better. An extended example of the data augmentation method applied to the 
missing data problem is given in Chapter 26. 
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13.8. Bayesian Model Selection 


Chapters 7 and 8 dealt with issues of hypothesis testing, specification diagnostics, and 
model comparison from a frequentist viewpoint. In this section we consider the prin- 
cipal tool, Bayes factors, that is used in Bayesian analysis to evaluate the strength 
of evidence in favor of the null hypothesis (model). It also serves as a criterion for 
model selection, irrespective of whether nested or nonnested pairs of models are un- 
der consideration. In the econometrics literature, Zellner (1971, 1978) provided an 
early discussion in the context of model selection. Our treatment is based on Kass and 
Raftery’s (1995) review article. 

Denote the data by y and the two hypotheses under consideration, possibly 
nonnested, by Hı and H2. Prior probabilities of the two hypotheses are Pr[ HM] and 
Pr[ H2]. The corresponding dgps are Pr[y| M1] and Pr[y| H2] = 1 — Pr[y| Hı]. The prior 
probabilities of the models are transformed to posterior probabilities by the sample ev- 
idence as reflected in the likelihood. By Bayes’ Theorem 


Pr[y| Ay |Prl Ax] 
Pr[y| H, JPrl A] + Prly| H2|Pr[H2]’ 


Prf Ay ly] = =e (13.52) 


and the posterior odds ratio 


PrlAily] _ PrlylAijPrifi} _ p Pri] 
Pr[Holy]  Pr[y|H2]Pr[H2] 


, (13.53) 


where Bı2 = Pr[y| H1] /Pr[y| H2], is called the Bayes factor. Hypothesis 1 is preferred 
if the posterior odds ratio exceeds one. The right-hand side of (13.53) expresses the 
posterior odds ratio as the product of the Bayes factor and the prior odds. If a priori the 
two models are equally probable, so Pr[ H1] = Pr[ H2], then the Bayes factor equals 
the posterior odds in favor of Hj. If several hypotheses are involved the Bayes factor 
can be computed for all pairs of hypotheses. The Bayes factor is defined even if the 
hypotheses are not nested. 

The Bayes factor has the form of a likelihood ratio. It depends on unknown parame- 
ters, denoted by vectors 0, and 0%, that are eliminated by averaging or integrating over 
the parameter space with respect to the prior, so 


Pr iyik] = Í PrlylO,, Hil (Gx\ i). k=1,2. (13.54) 


From Section 13.2.5, this equation gives the marginal and the predictive probability of 
the data given the prior distribution. 

A complication is that this expression depends on all the constants that appear in the 
likelihood. These constants can be neglected when evaluating the posterior, but they 
are required for the computation of the Bayes factor. The integral in (13.54) may need 
to be numerically evaluated if it does not have an explicit solution using, for example, 
importance sampling. There is a substantial literature, reviewed in Kass and Raftery 
(1995), on the computation of the Bayes factor that we will not pursue here. We note 
that there are some asymptotic approximations to the Bayes factors that are readily 
computable using output from packages that maximize likelihoods. 
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Table 13.4. Interpretation of Bayes Factors 


Bayes Factor B12 2ln(B12) Evidence against Hı 
lto3 Oto2 weak 

3 to 20 2to6 positive 

20 to 150 6to 10 strong 

>150 >10 very strong 


Interpretation of the Bayes factor is in terms of evidence against the H1. “The Bayes 
factor is a summary of the evidence provided by the data in favor of one scientific 
theory, represented by a statistical model, as opposed to another” (Kass and Raftery, 
1995, p. 777). In the frequentist analysis twice the log-likelihood ratio is an often-used 
quantity. Similarly, twice the log of the Bayes factor is a criterion used in evaluating 
the evidence. Kass and Raftery present the following categorization of the strength of 
evidence against the null model that they have found useful in their own work; see 
Table 13.4. 

Suppose that two models under comparison are nested. Denote by Ho the con- 
strained model and H; the model that is unconstrained. A pairwise comparison of the 
two models using the posterior odds ratio requires computation of the Bayes factor, as 
shown earlier. The Bayes factor for the null hypothesis model is defined as 


_m(y|Ho) 
or my) 


where m(y|#;) is the marginal likelihood of the model specification H;. If the models 
Ho and H; are nested, then the Savage-Dickey density ratio approach (see Verdinelli 
and Wasserman, 1995) can be taken to calculate the Bayes factors. 

An important insight due to Chib (1995) has made the computation of Bayes factors 
a great deal easier than suggested by the earlier literature, irrespective of whether the 
models are nested or nonnested. His approach consists of two related ideas. The first 
rewrites the marginal density, for a given model Hx, m(y) as a ratio 

FIIO) 

m(y) Oly)” (13.54) 
where the numerator is the product of the density (inclusive of constants) and the prior, 
and the denominator is the posterior density of 0. This result is a rearrangement of the 
terms in equation (13.1), with the qualification that we have used the notation m(y) in 
place of f(y) or Pr[y|H;] used earlier; it merely states that the marginal density is the 
normalizing constant. Second, after a successful application of an MCMC algorithm, 
we will have available a Monte Carlo estimate of the posterior density estimate 7(8|y) 
at a given point 0. Then it follows that 


Inm(y) = In f(y|0)+ Inx(@) — Inw(@ly). (13.55) 


Therefore, given estimates of the terms on the right-hand side, the marginal density can 
be readily computed using the output from a Gibbs sampler. This approach has been 
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extended in Chib and Jeliazkov (2001) to the case where the output is instead from a 
Metropolis-Hastings algorithm. 

In complex and highly parameterized models, the computation of the Bayes factor 
is a nontrivial matter. However, it can be shown that the Schwarz criterion, also known 
as the Bayes information criterion (see Section 8.5), gives a rough approximation to 
the log of the Bayes factor. Recall that BIC = —2 In LO) + ln Nq. This is easy to 
compute if the value of the log-likelihood is available. 

From (13.52) it is obvious that the ratio of prior probabilities of the model plays a 
role in evaluating the evidence against the null. In many situations, the investigator may 
have little to go on in assigning these probabilities. This consideration has received 
some attention in the literature that deals with the sensitivity of the Bayes factor to the 
prior model probabilities. 


13.9. Practical Considerations 


The use of Markov chain methods has now become dominant in the Bayesian lit- 
erature. Because the methods are computer intensive, good software is essential. At 
the time of writing, the WinBUGS package, a later version of the BUGS (Bayesian 
inference Using Gibbs Sampling) package (Gilks et al., 1996), has been widely rec- 
ommended and found to be especially useful for hierarchical models and missing data 
problems. It is available at the BUGS Web site. More detailed information about other 
Bayesian software can be found in Gamerman (1997, Section 5.6). 

The issue of how long to run the chain continues to be an active area of research. Di- 
agnostic checks for convergence are available and have been mentioned, but they often 
do not have universal applicability. Cappé and Robert (2000) provide a review of the 
issues of implementation including stopping rules. The complexity of the conditional 
distributions is clearly an important factor. Graphs of output for scalar parameters from 
the Markov are a visually attractive way of confirming convergence, but more formal 
approaches are available (Geweke, 1992). Another suggestion, due to Gelman and 
Rubin (1992), is to use multiple (parallel) Gibbs samplers, each beginning with differ- 
ent starting values to see if different chains converge to the same posterior distribution. 
Zellner and Min (1995) propose several convergence criteria that can be used if the 
posterior can be written explicitly. 


13.10. Bibliographic Notes 


There are several excellent book-length treatments with emphasis on modern computational 
methods for Bayesian analysis, including those by Gamerman (1997) and Gelman et al. (1995). 
Relatively accessible treatments are provided by Gill (2002), Koop (2003), and Lancaster 
(2004). Koop presents Bayesian methods for many standard nonlinear cross-section models and 
for panel data. The older texts by Zellner (1971) and Leamer (1978) are still valuable sources 
of results. 


13.2 Stigler (1986) provides a good exposition of the work of Bayes (1764). Bayes first pre- 
sented some properties of probability, notably Pr[A|B] = Pr[A N B]/ Pr[B]. Bayes then 
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applied this result to obtaining the posterior probability Pr[a < 0 < b|y], where a and 
b are specified bounds, y is the number of successes in N binomial trials, and @ is the 
unknown probability of success in each trial. Bayes chose a uniform prior, in which case 
the posterior density f(@|y) x f(y|@). Bayes’ example was challenging as he could not 
accurately calculate the posterior probability, which involved the incomplete gamma, not 
tabulated until the 20th century. Bayes’ paper was initially neglected. A more commonly 
used approach due to Laplace and others was the method of inverse probability that also let 
fly) x f(y|@). These methods were supplanted by maximum likelihood, introduced by 
Fisher (1922), whose paper directly critiqued Bayesian and inverse-probability methods. 

The regularity conditions for convergence to posterior normality are discussed in Heyde 
and Johnstone (1979). Train (2003) provides an excellent but less formal treatment of the 
so-called Bernstein—von Mises Theorem. 

13.3 Zellner (1971) and Leamer (1978) are excellent sources for Bayesian analysis of linear 
regression. 

13.4 Geweke (1989) and Geweke and Keane (2001) are valuable references on Monte Carlo 
integration. 

13.5 Casella and George (1992) provide an expository treatment of the Gibbs sampler. Nu- 
merous papers by Chib and his collaborators and Geweke and his collaborators cover 
many topics of interest in microeconometrics. Chib and Greenberg (1996, Section 3) pro- 
vide a number of applications of MCMC, including the seemingly unrelated regression 
model and the Tobit and probit models. In the latter case they show the computational 
simplification that results from combining Gibbs sampling with the data augmentation 
approach. Data augmentation is used to handle latent variables that are introduced to 
deal with the underlying unobservables that arise naturally in many censored and dis- 
crete choice models. Chib (2001) provides a detailed and up-to-date survey that includes 
MCMC algorithms for many leading linear and nonlinear regression models. Geweke and 
Keane (2000) concentrate on the methods of integration; they cover both Bayesian and 
non-Bayesian topics. 


Exercises 


21-1 Show that if GljA ~ N[u, A71], and à ~ Gammafa/2, œ/2], then the uncondi- 
tional distribution of 6 is a multivariate t-distribution with parameters (u, ©, œ). 


21-2 (Adapted from Chib, 1992). Consider the censored regression or Tobit model 
(see Section 16.3) where y* = XB + e, e ~iid NO, 07], and yis observed when 
y* > 0 but is not observed (censored) when y* < 0. There are No censored ob- 
servations on y, and y refers to them. Introduce a latent variable z that cor- 
responds to the censored observations such that z < 0 if the ith observation 
belongs to the censored set. The data augmentation method can be used to 
draw latent variables z, a set of independent random variables distributed as 
truncated normal, with support (—oo, 0) and pdf $(Z| yj, 3,07)/(1 — (x, B/o)), 
—oo < Z < 0, where ¢ and © are, respectively the normal pdf and cdf. Use a 
normal prior for 8 and a gamma prior for o~?. 

(a) Show that it is possible to specify a full set of conditionals for z, 3, and o~?. 

(b) Use the results of part (a) to outline the Gibbs algorithm for simulating z, 3, 
and o~?. 

(c) Explain how suitable initial values of 8 and o~? may be obtained. 
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PART FOUR 


Models for Cross-Section 
Data 


Part 4, consisting of chapters 14 to 20, covers the core nonlinear limited dependent 
variable models for cross-section data, defined by the range of values taken by the 
dependent variable. Topics covered include models for binary, multinomial, duration 
and count data. The complications of censoring, truncation and sample selection are 
also studied. The essential base for Part 4 is least squares and maximum likelihood 
estimation. 

Chapters 14-15 cover models for binary and multinomial data that are standard in 
the analysis of discrete outcomes and discrete choice. Maximum likelihood methods 
are dominant. Different parameterizations for the conditional probabilities in these 
models lead to different models, notably logit and probit models, which are well- 
established. Recent literature has focused on less restrictive modeling with more flex- 
ible functional forms for conditional probabilities and on accommodating individual 
unobserved heterogeneity. These objectives motivate the use of semiparametric meth- 
ods and simulation-based estimation methods. 

Censoring, truncation, or sample selection generate several important classes of 
models that are analyzed in Chapter 16. The long-established Tobit model is central to 
this literature, but its estimation and inference rely on strong distributional assumptions 
to permit consistent estimation. We also examine the newer semiparametric methods 
that rely on weaker assumptions. 

Chapters 17-19 consider duration models in which the focus is on either the de- 
terminants of spell lengths, such as length of an unemployment spell, or on modeling 
the hazard rate of transitions from one initial state to another. The analysis covers 
both discrete and continuous time models, and both parametric and semiparametric 
formulations, including the standard models like the exponential, the Weibull, and 
the proportional hazards model. Chapter 18 covers formulation and interpretation of 
richer models that incorporate unobserved heterogeneity. The relative importance of 
state dependence and unobserved heterogeneity as determinants of the average length 
of spell is a central issue, whose resolution raises fundamental questions about alterna- 
tive modeling approaches. Chapter 19 deals with models with several types of events 
using the competing risks formulation and models of multiple spells. 
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Chapter 20 covers the analysis of event count of the kind very common in health 
economics. There are many strong connections and parallels between count data mod- 
els and duration models because of their common foundation in stochastic processes. 
We analyze the widely-used Poisson and negative binomial regression models, to- 
gether with important variants such as the two-part or hurdle model, zero-inflated 
models, latent class models, and endogenous regressor models, all of which accom- 
modate different facets of the event processes. 
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CHAPTER 14 


Binary Outcome Models 


14.1. Introduction 


Discrete outcome or qualitative response models are models for a dependent variable 
that indicates in which one of m mutually exclusive categories the outcome of interest 
falls. Often there is no natural ordering of the categories. For example, categorization 
may be on the occupation of a worker. 

This chapter considers the simplest case of binary outcomes, where there are 
two possible outcomes. Examples include whether or not an individual is employed 
and whether or not a consumer makes a purchase. Binary outcomes are simple 
to model and estimation is usually by maximum likelihood because the distribu- 
tion of the data is necessarily defined by the Bernoulli model. If the probabil- 
ity of one outcome equals p, then the probability of the other outcome must be 
(1 — p). For regression applications the probability p will vary across individuals 
as a function of regressors. The two standard binary outcome models, the logit and 
the probit models, specify different functional forms for this probability as a func- 
tion of regressors. The difference between these estimators is qualitatively simi- 
lar to use of different functional forms for the conditional mean in least-squares 
regression. 

Section 14.2 provides a data example. Section 14.3 presents a summary of 
statistical results for standard models including logit and probit models. In Sec- 
tion 14.4 binary outcome models are presented as arising from an underlying 
latent variable. This formulation is useful as it extends readily to multinomial 
models (see Chapter 15) and models for censored and selected samples (see 
Chapter 16). Section 14.5 details necessary modifications to standard estimation 
methods when one of the outcomes is deliberately oversampled. Aggregation is- 
sues are considered in Section 14.6. Semiparametric methods for binary outcome 
models that place less structure on the model for the probability p are given in 
Section 14.7. 
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14.2. Binary Outcome Example: Fishing Mode Choice 


This section models choice between fishing from a charter boat and fishing from a pier. 
The dependent variable is binary with 


__ |1 if fishing from a charter boat, 
ae O if fishing from a pier, 


where the values 1 and 0 are chosen for simplicity. The single explanatory variable is 
x; = ln relp; = In(relp;) where relp denotes the price of charter fishing relative to the 
price of fishing from the pier, so 


x; = Inrelp; = In (pricecharter,i /pricepier,i) : 


The prices of charter boat and pier fishing vary across individuals owing to various 
factors, for example, to differences in access. It is expected that the probability of 
charter boat fishing decreases as its relative price increases. 

The data are summarized in Table 14.1. The sample of 630 individuals is a subset 
of the data described in greater detail in Section 15.2, where four different modes of 
fishing and additional regressors are considered. Charter boat fishing was selected by 
71.7% of the sample. For people choosing to fish from the charter boat, the charter boat 
was on average less expensive than pier fishing, as $75 < $121. For people choosing 
to fish from the pier the reverse was true. So it appears that price has the expected 
effect. 

An OLS regression of y; on x; ignores the discreteness of the dependent variable 
and does not constrain predicted probabilities to be between zero and one. 

A more appropriate model is the logit model (see Section 14.3.4), which specifies 


exp(B; + B2x;) 
1 + exp(B, + B2x;) 


and clearly ensures that 0 < p; < 1. Maximum likelihood estimation (see Sec- 
tion 14.3.3) leads to parameter estimates given in the first column of Table 14.2. The 
implied marginal effect for the logit model equals 

api exp(Bi + B2xi) 


ax; GC hep: + Bax? 


pi = Prly; = 1\x;] = 


Table 14.1. Fishing Mode Choice: Data Summary 


Subsample Averages 


y=1 y=0 Ally 
Variable Charter Pier Overall 
Price charter ($) 75 110 85 
Price pier ($) 121 31 95 
In relp —0.264 1.643 0.275 
Sample probability 0.717 0.283 1.000 
Observations 452 178 630 
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Table 14.2. Fishing Mode Choice: Logit and Probit Estimates“ 


Regressor Logit Model Probit OLS 
Constant 2.053 1.194 0.784 
(12.15) (13.34) (65.58) 
In relp —1.823 —1.056 —0.243 
(—12.61) (—13.87) (—28.15) 
—InL —206.83 —204.41 — 
Pseudo R? 0.449 0.455 0.463 


“ Dependent variable y = 1 if charter boat fishing and y = 0 if pier fishing. Regressor 
x = ln relp, the natural logarithm of the price of charter boat fishing relative to pier 
fishing. Intercept and slope parameter estimates with t-statistics in parentheses are 
from ML estimation of logit and probit models and from OLS estimation. 


Since Broci < 0 it follows that dp;/dx; < 0, as expected. The actual magnitude of 
the marginal effect varies with the point of evaluation x; (see Section 14.3.2). An ap- 
proximation for the logit model, though not other models, is that dp;/dx; ~ (1 — 
y)B> = —0.370. An OLS regression instead provides a direct estimate of —0.243. 

An alternative model is the probit model (see Section 14.3.5), which specifies 


pi = Prly; = 1]x;] = (61 + Box), 


where ®(-) is the cumulative distribution function for the standard normal, so p; = 
SE pee (2m) e= */2dz. The ML coefficients are given in the second column of Ta- 
ble 1 14.2 and differ appreciably from the logit coefficients. Since different specifications 
are being estimated the coefficients are not comparable. This is similar to our inabil- 
ity to compare coefficients in models with conditional mean x’@ and exp(x’@). For 
the probit model dp; /dx; = @(B1 + 62x;) 62, where @(-) is the density for the standard 
normal. So again dp;/dx; < 0 since Picante < 0. 

Although the slope coefficients necessarily differ across the models, from Ta- 
ble 14.2 the t-statistics are similar and are all very high. The log-likelihood for 
the probit model is 2.42 higher than that for the logit, favoring the probit model 
since both models use the same number of parameters. In many other examples there 
is little difference in In L across the models. The predicted probabilities from the 
three models are plotted as a function of x in Figure 14.1. In OLS we assume that 
Pr[y; = 1|x;] = 6; + Box; is linear in x;, whereas the nonlinear functions for logit and 
probit are essentially equivalent. 


14.3. Logit and Probit Models 


We now provide more formal theory for these models. We present binary outcomes 
as a direct extension of the coin-toss example of introductory statistics to situations 
where the probability of success is modeled to depend on regressors. Two commonly 
used parameterizations lead to the logit and probit models. Motivation for these pa- 
rameterizations, using latent variables, is deferred to Section 14.4. 
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Predicted Probabilities Across Models 


Actual Data (jittered) 
Logit 


o- Too nep petet ayaa n aE 


Predicted probability 
5 
L 


-2 0 2 4 


Log relative price (Inrelp) 


Figure 14.1: Charter boat fishing: predicted probability from logit and probit models and 
OLS prediction when the single regressor is the natural logarithm of relative price. Actual 
outcomes of 1 or 0 are also plotted after jittering for readability. Data for 620 individuals. 


14.3.1. General Binary Outcome Model 


For binary outcome data the dependent variable y takes one of two values. We let 


_ |1 with probability p, 
Y=] O0 with probability 1 — p. 


There is no loss of generality in setting the values to | and 0 if all that is being modeled 
is p, which determines the probability of the outcome. In introductory statistics this 
model describes the outcome of a coin toss where heads leads to y = 1 and occurs 
with probability p. 

A regression model is formed by parameterizing the probability p to depend on a 
regressor vector x and a K x | parameter vector 3. The commonly used models are 
of single-index form with conditional probability given by 


pi = Prly; = Ix] =F %8), (14.1) 


where F(-) is a specified function. To ensure that 0 < p < 1 it is natural to specify 
F(-) to be a cumulative distribution function. 

Table 14.3 presents the most commonly used binary outcome models. The logit 
model arises if F(-) is the cdf of the logistic distribution and the probit model arises 
if F(-) is the standard normal cdf. Note that if F(-) is a cdf, then this cdf is only 
being used to model the parameter p and does not denote the cdf of y itself. The 
less-used complementary log-log model arises if F(-) is the cdf of the extreme value 
distribution. It differs from the other models in being asymmetric around zero and is 
used when one of the outcomes is rare. The linear probability model does not use a 
cdf and instead lets p; = x; 8. 
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Table 14.3. Binary Outcome Data: Commonly Used Models 


Model Probability (p = Pr[y = 1|x]) Marginal Effect (0p/0x;) 
ex B 

Logit A&B) = A(x’ B)[1 — A'B); 

Probit (xB) = bese Gov (x B)B; 

Complementary log-log C(x’) = 1 — exp(— exp(x’B)) exp(— exp(x’G)) exp(x’B)B; 

Linear probability xB Bj 


14.3.2. Marginal Effects 


Interest lies in determining the marginal effect of change in a regressor on the condi- 
tional probability that y = 1. For general probability model (14.1) and change in the 
jth regressor, assumed to be continuous, this is 


ə Prly; = 1xi] = F'xi B)B;, (14.2) 

OXij 
where F’(z) = 0F(z)/0z. The marginal effects differ with the point of evaluation x;, 
as for any nonlinear model, and differ with different choices of F(-). The last column 
of Table 14.3 gives the marginal effects for the common binary outcome models. 

Marginal effects for nonlinear models are discussed in Section 5.2.4. Given a spe- 
cific model there are several ways to compute an average marginal effect. It is best to 
use N~! >, F’ BIB j» the sample average of the marginal effects. Some programs 
instead evaluate at the sample average of the regressors, F PBIB; An easily con- 
structed measure evaluates at y, the sample average of y, so that F(x'6) = y and 
F'(x' B) = F'(F-1(5)). This is especially simple for the logit model as then this yields 
estimated marginal effect y(1 — 5A ;- Further discussion for specific models is given 
in Sections 14.3.4-14.3.7. 

Many studies instead report only the regression coefficients. The standard binary 
outcome models are single-index models, so the ratio of coefficients for two different 
regressors equals the ratio of the marginal effects. The sign of the coefficient gives 
the sign of the marginal effect, since F’(-) > 0. The coefficients can be used to obtain 
an upper bound on the marginal effects. For the logit model dp/dx; < 0. 258, j> Since 
AX OA — A(Qx’B)) < 0. 25, with maximum when A(x’) = 0.5 and x’G = 0. For the 
probit model dp/dx; < 0. 4B, since $(x’B) < 1//2x ~ 0.4, with maximum when 
@(x’ 3) = 0.5 and x’3 = 0. 


14.3.3. ML Estimation 


We consider estimation given a sample (y;, x;), i = 1,..., N, where we assume inde- 
pendence over i. Results are given for p; defined in (14.1), with specialization to logit 
and probit specifications given later. 
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MLE for General Binary Outcome Models 


The outcome is Bernoulli distributed, the binomial distribution with just one trial. A 
very convenient compact notation for the density of y;, or more formally its probabil- 
ity mass function, is 


foils = př Op) %, yi=0,1, (14.3) 


where p; = F(x,3). This yields probabilities p; and (1 — p;) since f(1) = p'(1 — 
p} = p and f(0) = p°(1— p)'=1-p. 

The density (14.3) implies log density In f(y;) = yi In p; + (1 — y;) Ind — pj). 
Given independence over i and model (14.1) for p;, the log-likelihood function is 


N 


Ly(8) = X {yn F&B) + C1 = yi) nC — F(x, B))}. (14.4) 


i=l 


Differentiating with respect to G, we have that the MLE cm solves 


Wey 1— yi 
D nee -F'xi } = 0, 
F; ae 


i=l 


where F; = F(x;@), F; = F’(x;), and F(z) = dF (z)/dz. Converting to fractions 
with common denominator F;(1 — F;) and simplifying yields the ML first-order con- 
ditions 


N ee 4 
i=1 


F(x; A). — F(x) 


There is no explicit solution for Brass but the Newton—Raphson iterative procedure 
usually converges very quickly since for the probit and logit models, at least, the log- 
likelihood is globally concave. 


Consistency of the MLE 


The MLE is consistent if the conditional density of y given x is correctly specified. 
Since the density here must be the Bernoulli, the only possible misspecification is that 
the Bernoulli probability is misspecified. So the MLE is consistent if p; = F(x; 3) and 
is inconsistent otherwise. 

More formally, note that for binary data, E[y] = 1 x p +0 x (1 — p) = p. Given 
(14.1) this implies 


Ely;|x;] = F(%;8), (14.6) 


which in turn implies that the left-hand side of the first-order equations (14.5) has 
expected value zero, the essential condition for consistency. This special result of con- 
sistency provided the conditional mean is correctly specified holds for LEF densities 
(see Section 5.7.3) and the Bernoulli is an LEF density. 
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Distribution of the MLE 


Given correct specification of the density, Bun ~ NIB, (—E[a?Ly/d 393’) '] (see 
Section 5.6.4). Differentiating (14.4) with respect to 3’, and taking minus the expected 
value yields the estimated asymptotic variance matrix 


N 


—1 

pur 1 as 

V = = — F'(x'3)°x;x, |, 14.7 
[Bm] (£ FB — FEA) (x; 8) x x) ( ) 


i=l 


where simplification occurs because E[y; — F(x; 6)] = 0. This variance matrix is of 
the simple form (X`; #;x;x;)~', where the weights #; are given in (14.7). 

Since consistency requires only correct specification of the conditional mean or 
probability, it is natural to consider the quasi-MLE (see Section 5.7) and base infer- 
ence on the sandwich form of the variance matrix A~!'BA™! rather than —A7! used in 
(14.7). Here 


Viyilxi] = FŒ — F(x;9)), (14.8) 


since V[y] = (1 — p}? x p+(O— p}? x (1 — p) = p(1 — p). Some algebra shows 
that this implies that A = —B and hence A~'BA~! = —A™!, assuming independence 
over i. The only way that (14.8) does not hold is if p # F(x’) in which case the MLE 
would suffer from the more fundamental problem of inconsistency. 

Binary outcome models are unusual in that there is no advantage in using the sand- 
wich form if data are independent over i. The only reason for moving to a robust vari- 
ance matrix estimate is if observations are correlated over i as the result of clustering. 
Then the robust estimate needs to be one that is robust to clustering (see Section 24.5) 
rather than to misspecification of the conditional variance. 


14.3.4. Logit Model 


The logit model or logistic regression model specifies 


1 ex 
p = Ax 6) = There’ (14.9) 


where A(-) is the logistic cdf (see Section 14.4.1 for further details), with A(z) = 
e/A+e)=1/d+e%). 
The logit MLE first-order conditions (14.5) simplify to 


N 
Y (i — AG B))x; = 0, (14.10) 
i=] 


since A’(z) = A(z)[1 — A(z)]. So the raw residual y; — A(x’) is orthogonal to the 
regressors, similar to OLS regression. This simple form arises because A(-) is the 
canonical link function (see Section 5.7.4) for the Bernoulli density. 

If the regressors x; include an intercept, then (14.10) implies that vO - 
A(x’ B) = = 0, so the lon residuals sum to zero. This implies that the average in-sample 
predicted probability N“! >, A(X; B) necessarily equals the sample frequency y. 
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The marginal effects for the logit model can be fairly easily obtained from the 
coefficients, since 0p; /0x;; = pi(1 — p;)B;, where p; = A; = A(x; 8). Evaluating at 
pi = y yields a crude estimated marginal effect of (1 — 5P; For 0.3 < p; < 0.7, 
for example, dp;/0x;; lies between 0.218; and 0.258;. For data where p; ~ 0.0, in 
which case most outcomes are zero, dp;/0x;; = pip; so Bj gives the proportionate 
effect on the probability that y; = 1 as x;; changes. 

In the statistics literature a very common interpretation of the coefficients is in terms 
of marginal effects on the odds ratio rather than on the probability. For the logit model 


p = exp(x’B)/(1 + exp(x’B) 
=> 75, = exp(x’B) (14.11) 


=> In 35 =x. 


Here p/(1 — p) measures the probability that y = 1 relative to the probability that 
y = 0 and is called the odds ratio or relative risk. For example, consider a phar- 
maceutical drug study where y = 1 denotes survival and y = 0 denotes death and 
regressors include a measure of drug intake. An odds ratio of 2 means that the odds of 
survival are twice those of death. For the logit model the log-odds ratio is linear in the 
regressors. 

Statistical analyses and packages use the second equality in (14.11). Suppose the 
jth regressor increases by one unit. Then exp(x’@) increases to exp(x’G + B;) = 
exp(x’3) x exp (8;). It follows from (14.11) that the odds ratio has increased by a mul- 
tiple exp (£;). Thus a logit model slope parameter of 0.1, for example, means that a 
one unit increase in the regressor multiplies the initial odds ratio by exp(0.1) = 1.105. 
This is a proportionate increase of 0.105 times the initial odds ratio, so the relative 
probability of survival increases by 10.5%. This interpretation of the logit model is 
widely used in biostatistics applications. 

For economists it is more natural to interpret either the second or third equality in 
(14.11) as implying that 6; is a semi-elasticity. Then, taking a calculus approach, we 
interpret a logit model slope parameter of 0.1 as meaning that a one-unit increase in 
the regressor increases the odds ratio by a multiple 0.1. This coincides exactly with the 
interpretation used in statistics for very small £;, since then exp(f;) — 1 ~ Bj. 


14.3.5. Probit Model 


The probit model specifies the conditional probability 
xB 
p=0%x0= f o(z)dz, (14.12) 


where ®(-) is the standard normal cdf, with derivative ¢(z) = (1/27) exp(—z7/2), 
which is the standard normal density function. 
The probit MLE first-order conditions are that 


N 
Yo wily; — P(x; B))xi = 0, 
i=l 
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where, unlike the logit model, the weight w; = 6(x;)/[®(x;B)(1 — $x; 6))] varies 
across observations. 

The probit model marginal effects are dp;/dx;; = $(x,)B; = p(®~'(p;))B; 
where p; = (x; 3). There are no further simplifications similar to those for the logit 
model, though dp;/0x;; < 0.408; since (z) < $(0.5) = 1/27. 

The probit model is not as simple as the logit model. It is nevertheless widely used 
as it is the natural model if the starting point is a latent normal regression model (see 
Section 14.4). 


14.3.6. OLS Estimation 


A simple alternative to either logit or probit is OLS regression of y on x. This has 
the obvious deficiency that it is possible to obtain predicted probabilities x8 that are 
negative or that exceed one. 

The OLS estimator is nonetheless useful as an exploratory tool. In practice it pro- 
vides a reasonable direct estimate of the sample-average marginal effect on the prob- 
ability that y = 1 as x changes, even though it provides a poor model for individual 
probabilities. In practice it provides a good guide to which variables are statistically 
significant. In many applications it turns out that 0 < x8 < 1 for all sample observa- 
tions, in which case OLS is more reasonable. 

If the OLS estimator is used then standard errors should correct for heteroskedas- 
ticity. Linear regression is justified if the probability p; = x; 6. Then y;|x; has mean 
x’ 68 and heteroskedastic variance x; G(1 — x; 6) that varies with x;. 

In theory more efficient ML estimation is possible if p; = x; 8. From (14.5) the ML 
first-order conditions are J`; x;(y; — x,3)/[x; (1 — x’ B)] = 0. The estimator can be 
numerically unstable as it places very high weight on to observations with x; 6 close 
to 0 or 1. Moreover, the efficiency gains compared to OLS are often small. 

Although OLS estimation with heteroskedastic standard errors can be a useful ex- 
ploratory data analysis tool, it is best to use the logit or probit MLE for final data 
analysis. 


14.3.7. Choosing a Binary Model 


Which model should be used — logit or probit? This question is explored in this section. 


Theoretical Considerations 


Theoretically the answer depends on the dgp, which is unknown. Unlike other appli- 
cations of ML there is no problem in specifying the distribution — the only possible 
distribution for a (0, 1) variable is the Bernoulli. The problem lies in specifying a 
functional form for the parameter of this distribution. If the dgp has p = A(x’) then 
a logit model should be used, and estimators based on other models such as probit 
are potentially inconsistent. Similar qualitative conclusions hold if instead the dgp has 
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p = ®(x'f), in which case the probit model should be used. It is very unlikely that 
p =x’ since then p is not restricted to be between 0 and 1. 

The theoretical consequences of model misspecification, however, are not as great 
as this. If the regressors are distributed such that the mean of each regressor, condi- 
tional on the linear combination x’, is linear in x’, then choosing the wrong function 
F can be shown to affect all slope parameters equally so that the ratio of slope param- 
eters is constant across models; see Ruud (1983). This condition is satisfied by the 
family of spherical distributions, including the multivariate normal. 

The logit model has a relatively simple form for the first-order conditions and 
asymptotic distribution. Berkson (1951), who popularized the logit model, gave this 
as one of several reasons for preferring the logit model to the original probit model. 
Within the framework of generalized linear models, which are widely used in biostatis- 
tics, the logit model is the natural model as it corresponds to use of the canonical link 
for the binomial distribution. The interpretation of coefficients in terms of the log-odds 
ratio is also an attraction of the logit model. 

Yet another motivation for the logit model is discriminant analysis. In discriminant 
analysis both y and x are random variables; x is observed but y is not observed. Given 
x we need to determine whether y equals zero or one. A classic example is classifying 
what type of humanoid (y = 0 or 1) a skull belongs to given various dimensions (x) of 
the skull. If the conditional distributions of the characteristics x given y are multivariate 
normal distributed, the posterior probability of y given x is similar to the probability in 
the logit model. For more details, see Amemiya (1981, pp. 1507-1510) and Maddala 
(1983, pp. 17-21). 

The probit model, in contrast, has the attraction of being motivated by a latent nor- 
mal random variable (see Section 14.4) and extends naturally to Tobit models (see 
Chapter 16). For these reasons many economists use the probit model. 


Empirical Considerations 


Empirically, either logit and probit can be used. There is often little difference be- 
tween the predicted probabilities from probit and logit models. The difference is great- 
est in the tails where probabilities are close to 0 or 1. The difference is much less if 
interest lies only in marginal effects averaged over the sample rather than for each 
individual. 

The natural metric to use to compare models is the fitted log-likelihood, since there 
is agreement that the log-likelihood is correct, given the model for p;, and the logit and 
probit models have the same number of parameters. Thus for each model compute 


Ly(B) = X {yi InP; + A — yi) Ind — PD}, 
where pj = A(X! Brosit) or Pi = (x! Bpropit)- Often the fitted log-likelihoods are very 
similar for the two models, again suggesting little additional gain to using one rather 


than the other model. For more formal nonnested model tests see Pesaran and Pesaran 
(1995) and Section 8.5. 
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The different models do yield quite different estimates B of regression parameters. 
However, this is just an artifact of using different formulas for the probabilities. It is 
more meaningful to compare the marginal effect across models, as this measure is 
scaled similarly across the three models. From Section 14.2.3, dp/dxj; < 0.258 j for 
logit, dp/dx; < 0.48; for probit, and dp/dx; = B; for OLS. This suggests the rule 
of thumb 


Bros 48 ars: (14.13) 
Bprobit X 2-5Bors: 


BLogit = 1.68 propit- 


Amemiya (1981, p. 1488) demonstrates that these comparisons work quite well for 
slope parameters if 0.1 < p < 0.9. Greater departures across the models occur in 
the tails. For logit an alternative method, based on (14.18) given later; uses Blogit ~ 


(x / <3) 8 probit- 


Endogenous Regressors 


Logit and probit models can be extended to handle many of the complications that 
commonly arise in microeconometric analysis. In particular, endogenous regressors 
are accommodated using methods similar to those for censored data given in Sec- 
tion 16.8.2, and panel data methods are presented in Chapter 23. 

For such complications it is easier to work with the linear probability model, since 
then standard linear model methods can be applied provided standard errors adjust for 
heteroskedasticity. Even if logit and probit models are ultimately used, a linear model 
can be useful for exploratory analysis. 


14.3.8. Determining Model Adequacy 


Model diagnostics and selection for nonlinear models were presented in Section 8.7. 
Here we consider specialization to binary outcome models. There is no single best 
measure, and statistical packages accordingly report several measures detailed in 
Amemiya (1981) and Maddala (1983). 


Pseudo- R? 


A standard measure of goodness of fit in the linear regression model is R*. Generaliza- 
tions to nonlinear models are called pseudo-R?, with several generalizations possible. 

A preferred measure is the relative gain measure denoted Rj, in Section 8.7.1. This 
measure is not always computable, but it is for the binary outcome model since Q max 
the maximum possible value of the log-likelihood, is zero. To obtain this result note 
that the best possible fit is clearly a y* that predicts y = 1 with probability p = 1 and 
y = 0 with probability 1 — p = 0, in which case f(y*) = 1 and In f(y*) = 0. Then 
Rg = 1 — (0 — Qi)/(O — Qo) = 1 — Qst/ Qo. This yields the R? measure for binary 
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outcome models proposed by McFadden (1974): 


1— En) (14.14) 


Rie = = Ln) 


_,— Lilyilnp + 0 — y) nd — PD) 
7 Ny ing yin =y) 


where p; = F(x.) and y= NT! Ñ, yi. 

Additional R? measures, many specific to binary data, are given in Amemiya (1981) 
and Maddala (1983). An obvious one is the squared sample correlation between y; 
and p;. One of these additional measures is also attributed to McFadden, and many 
references give this measure rather than the R? in (14.14). 


Predicted Outcomes 


In the linear regression model goodness of fit is often evaluated by comparison of 
fitted and actual values. For binary data the fitted value y should be binary since y 
is binary. The criterion )~,(y; — J)? gives the number of wrong predictions, which 
arise if (y, Y) equals (1, 0) or (0, 1). An obvious prediction rule is to set y = 1 when 
p=F (x’B) > 0.5. However, this has the weakness that if most of the sample has 
y = 1 then often X; O; — X)? = n(1 — F) since it is likely that P > 0.5 and hence 
y = 1 for all the observations. Similar problems arise if most of the sample has y = 0. 

More generally, a range of cutoff values may be considered. Letting Y = 1 when 
P > c, we obtain the receiver operating characteristics (ROC) curve which plots 
the fraction of y = 1 values correctly classified against the fraction of y = 0 values 
incorrectly specified as the cutoff c varies. For c = 1 all values are predicted to be 1, 
so all y = 1 values are correctly specified and all y = 0 values are incorrectly specified 
and the ROC curve takes value (0, 0). Similarly, for c = 0 the ROC curve takes value 
(1, 1). 

If the model has no predictive ability the ROC curve is a straight line between these 
points. The more bowed the curve, and the more area under it, the better the predictive 
power of the model. 


Predicted Probabilities 


Since binary data have a simple discrete distribution, an obvious approach is to 
n the sample average predicted probability that y = 1, N7 ! §, Pi, where 
p= F(x; B), with the sample frequency y. However, this is not useful for the logit 
en with an intercept, since NT 1 $; p; = J always holds as the ML first-order con- 
ditions imply } `; [y; — A(x’ I= = 0. A similar result holds for estimation by OLS; for 
the probit model the result is not exact but in practice is quite close. 
This approach can be used for predictions over subsamples, however, and can then 
form the basis for the chi-square goodness-of-fit test given in Section 8.2.6. 
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14.4. Latent Variable Models 


A latent variable is a variable that is incompletely observed. Latent variables can be 
introduced into binary outcome models in two different ways. In the first the latent 
variable is an index of an unobserved propensity for the event of interest to occur. 
In the second the latent variable is the difference in utility that occurs if the event 
of interest occurs, which presumes that the binary outcome is a result of individual 
choice. The latter method makes clear the need to distinguish between regressors that 
vary across alternatives for a given individual and regressors such as socioeconomic 
characteristics that for a given individual are invariant across alternatives. 

It should be stressed that the binary outcome is Bernoulli distributed, as in Sec- 
tion 14.3. Latent variable models merely provide a rationale for a particular functional 
form for the Bernoulli parameter. 

Latent variable models do provide extensions to multinomial outcomes and cen- 
sored outcomes (detailed in Chapters 15 and 16). They also provide a framework that 
permits Bayesian analysis using data augmentation (see Section 13.7). Brief discus- 
sion of Bayesian analysis of binary and multinomial data is given in Sections 15.7.2 
and 15.8.2. 


14.4.1. Index Function Models 


In the index function formulation interest lies in explaining an underlying unobserved 
continuous random variable y*, but all we observe is the binary variable y, which takes 
value 1 or 0 according to whether or not y* crosses a threshold. Different distributions 
for y* lead to different binary outcome models. 

Let y* be a latent (or unobserved) variable, such as the desire to work if labor supply 
is being modeled. The natural regression model for y* is the index function model 


yr=x'B+u. (14.15) 
However, this model cannot be estimated as y* is not observed. Instead, we observe 


Lif y* > 0, 
y= | 


oie yt <6. (14.16) 


where the threshold of zero is a normalization explained in the following. 
Given (14.16), 
Pr[y = 1|x] = Pr[y* > 0] (14.17) 
= Pr[x’G+u > 0] 
= Pr[—u < x’B] 
= F(x’f), 
where F is the cdf of —u, which equals the cdf of u in the usual case of density 
symmetric about 0. 


The index function model therefore provides motivation for the functional form of 
F(-) in (14.1). 
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Probit and Logit Models 


The probit model arises if the error u is standard normal distributed, since then (14.17) 
yields Pr[—u < x’3] = &(x’B), where ®(-) is the cdf of the standard normal. 
Now introduce the logistic distribution. In its standard form the logistic has cdf 


Alu) = e"/ +e"), —0o <u < o. (14.18) 


The density function A’(u) = e” /(1 + e”)? is symmetric about 0, and a logistic random 
variable has mean 0 and variance 17/3 ~ 1.814. 

The logit model arises if the error u is logistic distributed, since then (14.17) yields 
Pr[—u < x’3] = A(x’). Note that G is scaled differently in the two models due to 
different V[u]. 


Identification Considerations 


Identification of the single-index model requires a restriction on the variance of u, as 
the single-index model can only identify 6 up to scale. All that is observed is whether 
or not y* > 0, or equivalently whether or not x’G + u > 0. However, this is equivalent 
to whether or not x/3* + u* > 0, where B% = aß and ut = au for any a > 0. Plac- 
ing a restriction on the variance of the error (u or ut) secures uniqueness of 3. The 
error variance is set to one in the probit model and 27/3 in the logit model. 

The threshold for the index model need not be zero. If more generally y = 1 when 
y* > 7’6 then (14.17) becomes Pr[y = 1] = F(x'G — z'6). Then 6 can be separately 
identified if and only if all components of z and x differ. In particular, if both x and 
z include intercepts these cannot be separately identified, so we normalize the thresh- 
old intercept to be zero. Note also that the mean of the error distribution needs to be 
normalized. For the logit and probit models it is set to zero. 


Discussion 


The index function model implies a direct interpretation of 8 as the change in the 
latent variable y* when x changes by one unit. Even though y* is unobserved, this 
interpretation is meaningful if one uses knowledge of the specified variance of u. For 
example, a slope parameter of 0.5 in the probit model means a one-unit change in 
the regressor leads to a 0.5 standard deviation change in y*, since in this model the 
variance of y* equals 1. 

Commonly used extensions of the index function approach are to ordered discrete 
choice models (see Section 15.9) and to models for censored and selected samples (see 
Chapter 16). 


14.4.2. Random Utility Models 


In the random utility formulation a consumer chooses between alternatives 0 and 1 
according to which has the higher satisfaction or utility. The discrete variable y then 
takes value 1 if alternative 1 has higher utility, and it takes value 0 if alternative O has 
higher utility. 
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The additive random utility model (ARUM) specifies the utilities of alternatives 
0 and 1 to be 


Uo = Vo + 0, (14.19) 
U= Vi +s, 


where Vo and V; are deterministic components of utility and £ọ and £; are random 
components of utility. A simple example is Vo = x'68ọ and V, = x’G,, though from 
Section 14.4.3 only (6B; — Gp) is then identified. 

The alternative with higher utility is chosen. We observe y = 1, say, if U} > Uo. 
Owing to the presence of the random components of utility this is a random event 
with 


Pr[y = 1] = Pr[U; > Uo] (14.20) 


where F is the cdf of (¢9 — £1). This yields Pr[y = 1] = F(x’) if Vi — Vo = x’B. 
The ARUM requires a scale normalization since if U} > Uo then aU; > aUp. This 
is usually done by specifying the variance of £ọ — £; or the variances of £ọ and ¢ . 
Different specifications for the distributions of £ọ and £; give different F(-) and 
hence different discrete choice models. The random utility formulation is especially 
useful for specifying unordered multinomial choice models (see Section 15.5). 


Probit and Logit Models 


An obvious choice for error distribution in (14.19) is that £ọ and e; are normal. Then 
(£o — £1) is normally distributed. Normalization of the variance of (€9 — £1) to unity 
gives the probit model since then F(-) in (14.20) is the standard normal cdf. 

Now introduce the type 1 extreme value distribution or log Weibull distribution. 
Then the random variable ¢ has density 


f(e) = e° exp(—e™*), —00 < € < 00, (14.21) 


and cdf F (e) = exp(—e~®). The extreme value distributions, rarely used in economet- 
rics, are obtained as limiting distributions as N — oo of the maximum of N random 
variables drawn from the same distribution. The type 1 extreme value distribution is 
a special case that is right-skewed over (—0o, co) with most of the mass between —2 
and 5. It has median — In(— 1n(0.5)) ~ 0.36651, mean I'’(1) ~ 0.57722, where T(x) 
denotes the derivative of the gamma function, and variance 27/6 ~ 1.28255*. The 
distribution is well approximated by a log-normal. 

The logit model arises if £ọ and £; are assumed to be independent type 1 extreme 
value distributed. Then the difference (£o — £1) can be shown to be logistic distributed 
(see Johnson and Kotz, 1970), so F(-) in (14.20) is the logistic cdf. 

An alternative derivation of this result, working directly with the extreme value 
distribution, is given later in Section 14.8. The derivation indicates the difficulty in 
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obtaining closed-form solutions for probabilities when the ARUM is extended to 
choice among three or more alternatives in Section 15.5. Recent computational ad- 
vances permit estimation even in the absence of a closed-form solution. 


14.4.3. Alternative- Varying Regressors 


In most applications of binary choice models, some regressors vary across individuals, 
but regressors do not necessarily vary across alternatives. 

At the one extreme regressors do not vary across alternatives. For example, in labor 
supply models of the decision to work, socioeconomic characteristics such as income 
and gender do not vary across alternatives. A potential regressor, the wage rate, does 
vary across the alternatives of work or not work, but this regressor is usually not in- 
cluded as it is only observed for those who choose to work. 

At the other extreme all regressors may vary across alternatives. For example, in 
transportation mode choice models the regressors may be the time cost and money 
cost of the two models of transportation. 

A general hybrid ARUM defines the deterministic components of utility in (14.19) 
to be 


Vij = 20; +W; 7 =9,1, (14.22) 


where z;; are regressors that take different values across the two alternatives, whereas 
w; are individual characteristics that do not vary with the choice. Then (14.20) yields 


Pr[y; = 1] = F(z 041 — Z0'ao + Wi(¥) — Yo))- 


For alternative-invariant regressors only the parameter difference (yı — Yo) can be 
identified. For alternative-varying regressors that do vary across alternatives and 
across individuals the coefficients can vary over alternatives, but it is customary to set 
Q1 = Qo = a. For example, the loss of utility resulting from a one-dollar increase in 
travel costs is expected to be the same across different transportation modes. Thus the 
ARUM leads to 


Pry; = 1] = F((di1 — Zio) œ + wi (Yi — Yo); (14.23) 


which is the original binary choice model (14.1) where the regressors are alternative- 
invariant regressors w and the difference across alternatives of alternative-varying re- 
gressors Z. 


14.5. Choice-Based Samples 


Choice-based sampling arises whenever selection of the sample is determined in part 
by values taken by the dependent variable y, rather than being completely random or 
being based in part by values taken by x. 

Discrete data models are a leading example since surveys often deliberately over- 
sample choices that are made infrequently. For example, if few people choose to 
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commute by bus, an oversampling of bus riders may be undertaken. In the medical 
literature the same problem arises with case-control analysis where, for example, a 
binary data analysis may be based on a full sample of those who had a heart attack and 
a subsample of people with similar characteristics who did not have a heart attack. The 
standard term choice-based sampling is a little misleading since it does not arise from 
individual choice. 

To see the inconsistency of standard binary choice methods, consider estimation 
of the logit model when the only regressor is the interop, Then A(x; 8) = A(B) 
and the logit MLE first-order conditions become NT! X` ;(y; — A(B)) = 0, so B= = 
In(y/(1 — ¥)). Consistency of B clearly requires a random sample because, for ex- 
ample, oversampling y = 1 leads to overestimation of y and hence B. 

Methods to obtain consistent estimates given endogenous sampling such as choice- 
based sampling are covered in detail in Section 24.4. Analysis is straight-forward if 
the degree of oversampling is known. Let Q, denote the fraction of the population 
with y = 1 and H; = y denote the fraction of the sample with y = 1. Similarly de- 
fine Qo = 1 — Q; and Hy = 1 — HA,. Then consistent estimation is possible using the 
weighted MLE proposed by Manski and Lerman (1977). For binary outcome models 
this maximizes the weighted log-likelihood 


Wray _ y Qı a _ = 
Lyo) A, yi In F(x; 3) + d — y;) Ind — F(%;,B)) 


i=1 


For example, if outcomes y = 1 are oversampled, then Q,/H, < 1 and the oversam- 
pled observations with y = 1 are downweighted. This estimator is easily implemented 
using any program for binary outcome models that permits weighting of observations. 
Then observations with y = 1 are given weight Q;/H and observations with y = 0 
are given weight Qo/Ho. 

A detailed summary of ML methods for choice-based sampling of binary and 
multinomial data, including methods when Q; and Qo are unknown, is given in 
Amemiya (1985, Section 9.5). The weighted MLE is inefficient but simple to imple- 
ment and the efficiency loss may not be great. Manski and McFadden (198 1a) pro- 
posed a variation that is more efficient (see Amemiya and Vuong, 1987). Cosslett 
(1981a,b) proposed further refinements that are fully efficient but impractical to im- 
plement. Imbens (1992) and Lancaster and Imbens (1996) proposed GMM estimation 
as an alternative method that is feasible to implement and is fully efficient. King and 
Zeng (2001) give a summary for the binary logit model; additionally, they consider 
small-sample corrections that, even with oversampling, make a difference when the 
population probability of interest occurs with low probability. For further details see 
Section 24.4. 

The epidemiological literature has focused on the logit model for case-control 
studies. The method is attributed to Prentice and Pyke (1979). See Breslow (1996), 
especially his Section 4.3, which discusses links between the econometrics and 
epidemiological literature. 
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14.6. Grouped and Aggregate Data 


In some applications only grouped or aggregate data may be available, yet individual 
behavior is felt to be best modeled by a binary choice model. Grouping poses no prob- 
lem when the grouping is based on unique values of the regressors and there are many 
observations per unique value of the regressors. We begin with this simple example 
before moving to more realistic ones. 


14.6.1. Berkson’s Minimum Chi-Square Estimator 


Suppose the regressor vector xj, i = 1,..., N, takes only T distinct values, where 
T is much smaller then N. Then for each value of the regressors we have multiple 
observations on y. This type of grouped data is called many observations per cell. It 
can arise particularly in experimental data where x is of low dimension and is set by 
experimental design to just a few values. Let x,,f = 1,..., T, be the T distinct values, 
N, be the number of observations on y, for the tth distinct value of x, so yoi N, =N, 
and p, be the proportion of times y; = 1 when x; = x;. Note that the subscript t is 
being used to denote grouping and does not necessarily denote time. 
For individual i with x; = x;, the Bernoulli probability is 


Pr = Priy; = Ix; = x] = F(x,8), (14.24) 
as before. Inverting (14.24) implies that 
EOD) a0. 


Now p, is unknown but can be estimated by p,, so Berkson proposed regressing 
F~'(p,) on x;. Thus we estimate by LS the transformation model 


F'@)=xi@+u, t=1,...,T. (14.25) 


The error term v; = F~!(p,) — F~'(p;) is heteroskedastic with variance that will de- 
crease as N, increases, since then p; is a better estimate of p,;, and will also depend 
on the shape of F(-). By Taylor series expansion (see Amemiya (1981, p. 1498) or 
Maddala (1983, p. 31)), v; has variance that can be consistently estimated by 


2__ B-P) 
T MEETOD 


Berkson’s minimum chi-square estimator Buc minimizes the weighted sum of 
residuals LL (F ~!(p,) — x’ B)/a? with respect to 8. This is easily computed by OLS 
regression of F~!(p,)/; on x, /G;. 

This estimator is simple to implement, as it only requires an OLS package. Yet it 
is fully efficient, as it can be shown to have the same asymptotic distribution as the 
MLE that treats each observation separately, rather than grouping them into cells with 
common regressor value x;. For the logit model this estimator is especially simple, as 
F(p) = ln(p,/(1 — pr)) and a; = 1/[N; pC — P:)]. 

The advantage of the minimum chi-square estimator is its computational simplicity, 
although advances in computer power now make this point moot. Grouped economics 
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data are rarely such that there are many observations within group per unique value 
of the regressors, unless the regressors are just a few indicator variables. The method 
does provide insights to aggregation, however, a topic we now consider. 


14.6.2. Estimation with Aggregate Data 


Econometrics examples of data aggregation include data on the proportion of people 
working and data on the proportion of those commuting by bus in different regions, 
explained by data on the average characteristics of people in the region. 

As a concrete example, suppose p, equals the unemployment rate in region t and 
X, equals the average level of schooling in region t. One possible model is LS regres- 
sion of p; on X;. Because 0 < p; < 1, many studies instead transform to a dependent 
variable that is unbounded, estimating the model 


in( Ee ) =X B+u,, (14.27) 
1— p; 
where u; is an error. 

This model looks similar to the minimum chi-square estimator for the logit model, 
when F~'!(p;) = In(p;/(1 — p;)). However, it is not because Berkson’s estimator is 
only appropriate if all regressors in the fth cell take the same value. Here instead the 
regressors can take different values, as different people in region t will have different 
levels of schooling. 

To see the consequences of aggregation when there is within-cell heterogeneity 
in the regressors, suppose the individual-level model is an index model (see Sec- 
tion 14.4.1) with 


ye =X; + ui, 
uj ~ N [0, 1]. 
We choose to work with normal errors, corresponding to a probit rather than logit 


model, because it is then possible to obtain analytical results. Model the heterogene- 
ity as 
Xj ~ N[p, =], 
for individuals in cell t. This realistically permits variation across cells, and the com- 
plication is that X, 4 0, so there is within-cell heterogeneity. Then in region t, condi- 
tional on 3, u,, and &,, 
Pr[y; = 1] = Prix;G+u; > 0] 


— Pr x; B+ui—u, p) ae -u,b 
VJ14+8'E:8 JS14+8'E:8 


where we use x, 3 + u; ~ N[/ 3, (1 + B'E 3)] given the preceding assumptions and 
then subtract the mean and divide by the standard deviation to transform to a standard 
normal variate. 
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By similar argument to that leading to (14.25) given (14.24), the underlying 
individual-level binary choice parameters 6 can be consistently estimated by nonlinear 
LS estimation of 8 in the regression 

egy sa 24 i, (14.28) 
v1+8S,8 
where y; and x; are cell averages and S, is the sample variance of x; in cell t. The Berk- 
son minimum chi-square estimate instead regresses ®~'(},) on X; and is inconsistent 
for G unless ©, = 0. 


14.6.3. Discussion 


Aggregation issues are much more complicated in nonlinear models. If the original 
individual-level model was the linear model y; = x; 6B + u; with x; ~ N[y,, ©] in 
the tth cell, then the corresponding linear regression of y, on X; would yield a con- 
sistent estimate of 3. With nonlinear models similar aggregation leads to inconsistent 
estimation of individual-level parameters, unless adjustment such as that in (14.28) is 
undertaken. Furthermore, the example in Section 14.6.2, due to McFadden and Reid 
(1975), is unusual in that aggregation of a nonlinear model leads to tractable results. 
This example is discussed in considerable detail by Cameron (1990), who considers it 
in the wider context of aggregation in nonlinear models. 

An active area of aggregation in discrete choice, usually multinomial choice, is the 
marketing literature on market shares of branded goods. Allenby and Ross (1991) 
present examples where the bias of fitting aggregate logit models may not be great. 
More importantly, recent computational advances permit estimation of individual-level 
parameters with aggregate data even if aggregation yields no closed-form solution. 
See, for example, Berry (1994) and Nevo (2001), who estimate models qualitatively 
similar to the random parameters logit model in Section 15.7. 

Finally, note that in many applications with aggregate proportions data, such as un- 
employment rate by region, there is no desire to estimate individual-level parameters. 
The only goal is a reasonable model for dependent variable p, that lies between zero 
and one. Then the linear regression (14.27) may be fine. The error u; in (14.27) will no 
longer have the variance given in (14.26). It will still be heteroskedastic, however, so 
statistical inference should be based on White heteroskedastic-robust standard errors. 


14.7. Semiparametric Estimation 


The binary outcome model is perhaps the leading example of semiparametric re- 
gression. Most econometrics studies presume a single-index form F(x; 3), where the 
functional form for F is not specified. The goal is to obtain an estimate of 68 that 
is consistent for 3, ideally /N-consistent and asymptotically normal, while F(-) is 
viewed as a nuisance function. The single-index model semiparametric estimators of 
Section 9.7.4 can be applied. Additional estimators exploit the index function model 
interpretation for binary outcomes. In addition, semiparametric ML estimation that 
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attains the semiparametric efficienccy bounds is possible with little need for additional 
assumptions, since it is clear that the distribution is Bernoulli and only F(x; 6) is not 
known. 


14.7.1. Semiparametric Conditional Mean Estimation 


The estimation problem in general is one where the dependent variable y; takes value 
O or 1 with conditional mean 


E[y;|x;] = m(x;), 


where m(-) is unknown. Note that m(x;) also equals the conditional probability that 
y=. 

The nonparametric regression methods of Sections 9.4—9.6 can be applied, despite 
the binary nature of the dependent variable. This is easily seen from Figure 14.1, a 
scatterplot of binary variable y on scalar regressor x, a natural candidate for kernel 
regression of y on x. The fitted values will lie between 0 and 1, aside from unusual 
cases such as when higher order kernels are used, in which case the fitted variable can 
take negative values. 

In many microeconometrics applications x is of too high a dimension for nonpara- 
metric methods to work well (the curse of dimensionality). Semiparametric regression 
models that partially specify m(-) are given in Section 9.7. Additive models are fairly 
popular in statistical applications. In econometrics single-index models are instead 
used, since a popular starting point is the index function model of Section 14.4.1. This 
yields a single-index model if the latent variable y* = x’G + u. Thus we suppose 


E[yi|xi] = F(x; 8), 


where we follow the notation of this chapter and use F(-) rather than g(-) to denote the 
unknown function. 

From Section 9.7.4, B is only identified up to location and scale. This is also clear 
from Section 14.4.1, where the error u in the index model was normalized to have 
mean 0 (location) and the variance needed to be specified (scale). Here restrictions are 
not placed on u, so 8 is not completely identified but the ratios of slope coefficients 
are identified. See Manski (1988b) for a detailed analysis of identification in binary 
choice models. 

/N-consistent asymptotically normal estimates of 3 can be obtained by average 
derivative estimation or by semiparametric least squares (see Section 9.7.4). However, 
alternative estimators, specific to binary outcomes, are more often used. 


14.7.2. Maximum Score Estimation 


Semiparametric estimators for binary outcomes are often based on the index function 
model y* = x’3 + u for binary outcomes. In such cases it is convenient to write the 
model as 


yi = 1/8 + u; > 0), 
where 1(A) = 1 if event A occurs. 
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Manski (1975) noted that the predicted value of y; is 1(x;6 > 0), setting u; = 0 
since u; is unknown, in which case a score of the number of correct predictions is 


N 
Sy(B) = $ D16 > 0) + C1 yL < 0), (14.29) 


i=1 


since correct predictions occur if y; = 1 and 1(x,@ > 0), or if y; = 0 and 1(x;G < 0). 
Manski’s maximum score estimator Bus maximizes Sy(Q). This is a nonstandard 
problem because 1(x;@ > 0) is not differentiable in B. Manski (1975, 1985) estab- 
lished consistency assuming F(0) = 0.5, or equivalently that Median[w;|x;] = 0. It 
has subsequently been shown that N!/ 3(B ng — 68) has a nonnormal limit distribu- 
tion, though inference can be performed using the bootstrap (Manski and Thompson 
(1986)). 

Manski’s estimator can be viewed as a least absolute deviations estimator. From 
Section 4.6.2, the LAD estimator minimizes the sum of absolute differences between 
y; and Median[y; |x; ]. This less familiar estimator is qualitatively similar to the LS es- 
timator, which minimizes the sum of absolute differences between y; and E[y;|x;]. 
To implement LAD here requires obtaining Median[y;|x;]. If Median[u;|x;] = 0 
then Median[y*|x;] = x; 6, so Median[y;|x;] = 1(x; 3 > 0). Thus the binary outcome 
model LAD estimator minimizes 


N 
On(8) = X` ly; — 16 > 0). (14.30) 
i=l 
From Exercise 14.4 Qy(@) = N — Sy(Q), so the maximum score estimator equals the 
LAD estimator. See Manski (1985, p. 320) for other interpretations of the maximum 
score estimator as a LAD estimator. 
The objective function Sy(Q@) for the maximum score estimator given in (14.29) is 
not differentiable. It can be rewritten as 


N N 
SnB) = $ Qy: — DIB > 0) +N — $ yi, 
i=l i=l 
see Exercise 14.4. The second sum can be ignored as it does not involve (3. 
An estimator with differentiable objective function is the smooth maximum score 
estimator of Horowitz (1992) that maximizes 


N 
O%(8) =) 2y: — DK (%B/hy), 


i=l 


where K (x’3/hy) is a smoothed version of 1(x’G > 0). Since 1(x'8 > 0) equals zero 
for negative values of x’@ and equals one for positive values of x’G it is natural to 
choose K(-) to be a cdf with K(O) = 0.5 and choose hy to be small. Smoothing 
simplifies computation of the estimator, but analysis is complicated by the need to 
have hy — 0 at appropriate rate as N — oo. The estimator converges at rate close to 
JN and is asymptotically normal. For details see Horowitz (2002), who presents a 
bootstrap that permits tests with better size properties in finite samples. 
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LAD estimation can be extended to the censored regression model (see Sec- 
tion 16.9.2). 


14.7.3. Maximum Rank Correlation Estimator 


Begin with a single-index model with E[y;|x;] = F(x; 6). If F(x; 3) is monotonically 
increasing in x; 3, then E[y;|x;] > ELy;|x;] if x;@ > x’. Thus it is likely, though not 
guaranteed, that the observed values y; > y; when x; 6 > x’. This suggests choosing 
P to ensure that with high frequency y; > yj when x; > x46. 

The maximum rank correlation (MRC) estimator of Han (1987) chooses ( to 
maximize 


N N 
ON (O) =) 9 10 > PUB > XB) +10; < y)UKB < xp). 


The ijth term in this sum equals one if y; > y; when x; 68 > x; B or if y; < y; when 
x; < x; 6, and equals zero if instead there is a sign reversal so that y; < yj when 
xx, ‘a or y; > yj when x, <x’, "B. The estimator is called the maximum rank 
corelio estimator because QMRC(B) i is a multiple of Kendall ’s rank correlation co- 
efficient between y; and x; 6. 

This estimator is /N-consistent and asymptotically normal (see Sherman, 1993). 


14.7.4. Semiparametric ML Estimation 


For binary choice data the likelihood function given independent observations is 
clearly that given in (14.4). The only complication is that F(-) is unknown. Klein and 
Spady (1993) proposed the semiparametric MLE that maximizes 


N 
Ly(B) = X {yi n F&B) +A — yi) In — F&B), 
i=l 
where F(x!) is a nonparametric estimate of F(x; 3). 

This estimator is similar in spirit to the WSLS estimator of Ichimura (1993) de- 
tailed in Section 9.7.4, and similar issues in computation arise with iteration between 
computation of B given F and computation of F given B. Given the ML first-order 
conditions (14.5), the semiparametric MLE can also be computed as the solution to 
the equations 


x F'(p) S 
a Oi — F&B: = 0, 
i PEB — FB) 

which are the same as those for the WSLS estimator with weights w; = F! / [F; qd — 
F]. 

The attraction of Klein and Spady’s estimator is that it is fully efficient in the sense 
that it attains the semiparametric efficiency bound. Computation is difficult, however. 
For details see Section 9.7.4, where similar computational issues are discussed for 
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Ichimura’s WSLS estimator, and see Klein and Spady (1993) and Pagan and Ullah 
(1999, pp. 283-285). 


14.7.5. Comparison of Semiparametric Estimators 


Econometricians focus on single-index models, and even then there are a multitude of 
semiparametric estimators available for the binary outcome model. None of these esti- 
mators are particularly simple to implement. The objective functions can have multiple 
optima and may not be smooth. For example, Horowitz (1992) uses simulated anneal- 
ing for the smooth maximum score estimator and Dorsey and Mayer (1995) use genetic 
algorithms to obtain the maximum score estimator. 

Interpretation of coefficients is also difficult. For example the maximum score esti- 
mator applied to the fishing mode data yielded intercept estimate of 0.776 and slope of 
—0.631 (with bootstrap-estimated standard error of 0.103), but these coefficients are 
not directly comparable to those given in Table 14.2. Indeed, since parameter slope 
estimates are only identified up to scale, the semiparametric estimates are most use- 
ful if several coefficients are included in the regression and coefficient estimates are 
compared to those of a reference variable. 

The maximum score and maximum rank correlation estimators are unusual among 
semiparametric estimators in not requiring use of smoothing parameters, such as 
choice of a bandwidth, an attractive property. The latter of these estimators is vN- 
consistent. 

In recent work Blundell and Powell (2004) propose semiparametric estimation with 
endogenous regressors. 


14.8. Derivation of Logit from Type I Extreme Value 


The derivation in Section 14.4.2 of the logit model from the ARUM used knowledge 
of the statistical result that the difference (£o — £1) of independent type 1 extreme 
value random variables is logistic distributed. For completeness we provide a direct 
derivation based on the distributions of £o and €14. 

Rewriting the second line of (14.20) yields 


Pr[y = 1] = Prieo < £1 + Vi — Vo] (14.31) 
= fe, SELV” fleo, e1)deode1 
= JS Sed [SEZ Fleodeo} der, 


where in the last line ¢9 and £; are assumed to be independent. By specializing f (£o) 
to the type 1 extreme value density, (14.31) becomes 


Priy = 1] = S% fer) [SEZ e exp(—e-™ deo} de (14.32) 
= fZ FED Lexp(-e EY der 
= JS f(e) expe Et- Wde]. 
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Using the extreme value density for £; in (14.32) yields 


Prly = 1] = f°, e~*! exp(—e~*!) exp(—e 1-0) de, (14.33) 
= Soa ewe! {exp(—e~"! = e-Ci+i-w))] de 
= fie e* {exp(—e~*! = e fig aol dé, 


= JS e~% exp {-e"(1 + gre Vey dé, 


Since f% _ae™® exp(—ae~*)de = 1 it follows that [™ e~* exp(—ae~*)de = 1/a. Us- 
ing this result with a = 1 + e7- in (14.33) yields 


Pry = 1] =(1+eM-%))"! (14.34) 
= eVi /(e% +e") 
= eV- (1 + eV'~), 


Letting Vı — Vo = x' yields the logit model. 


14.9. Practical Considerations 


Most packages provide probit and logit model estimators. The main choice for the 
practitioner is which model to use. In practice there is little difference in the predicted 
marginal effects obtained from the two models, unless most of the outcomes are zero 
or most of the outcomes are one. 

Semiparametric estimation generally requires special coding in languages such as 
GAUSS, though Lindep implements the estimators of Manski and Klein and Spady. 


14.10. Bibliographic Notes 


Logit and probit models are commonly used and relatively simple nonlinear regression models 
that appear in many standard texts such as Greene’s (2003). The surveys by Amemiya (1981) 
and McFadden (1984) include all the basic results. Maddala (1983) and Amemiya (1985) pro- 
vide further details. The books by Train (1986) and Ben-Akiva and Lerman (1985) are particu- 
larly good for applications. These references cover both binary and multinomial outcomes. 


14.3 Bliss (1934) proposed the probit transformation to plot dosage—mortality curves. Berkson 
(1951) popularized use of the simpler logit model. 

14.4 Latent variable models are especially popular in the psychometrics literature. 

14.5 Amemiya (1985, Section 9.5) provides an excellent survey of choice-based sampling for 
binary outcome models. See also Section 24.4. 

14.6 Cameron (1990) considers aggregation in binary outcome models and summarizes general 
results of Kelijian (1980) and Stoker (1984) on estimability of individual-level parameters 
in nonlinear models using aggregate data. 

14.7 The maximum score estimator of Manski (1975) is a leading early example of semipara- 
metric regression. Semiparametric methods for binary outcome models are covered in the 
books by M-J. Lee (1996), Horowitz (1997), and Pagan and Ullah (1999). The last refer- 
ence covers many methods. 
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Exercises 


Consider a latent variable modeled by y“ = x; + «;, with s; ~ NTO, 1]. Suppose 

we observe only y = 1 if y* < U; and y; = Oif y* > Uj, where the upper limit U; 

is a known constant for each individual (i.e., data) and may differ over individuals. 

(a) Find Pr[y; = 1|x;]. [Hint: Note that this differs from the standard case both 
due to presence of U; and because the equalities are reversed with y; = 1 
if yj < U;.] 

(b) Provide details on an estimation method to consistently estimate 68. 

(c) Suppose you estimate this model and find that the third regressor x3; has 
estimated coefficient Bs = 0.2. Provide a meaningful interpretation of B3. 


Consider the logit model with Pr[y = 1|x;, X2] = A(Bo + b1 Xi + 62X1), where 

A(2 = e7/(1 + e7)x. 

(a) Write down the likelihood scores and information matrix in an expanded 
form. 

(a) Use these to derive Wald and LM score tests of Ho : Bo = O. 

(c) Explain how you would computationally implement the tests. 

(d) In what sense is the logit model intrinsically heteroskedastic? 


14—3 Suppose we use an index formulation for a discrete choice model but it is felt 


14-4 


14-5 


that the latent variable is strictly positive. This is accommodated by supposing 

that the latent variable y* has exponential density with parameter y, so the 

density f(y*) is f(y*) = y~! exp(—y*/y), with y = exp(x' 8). We observe y= 1 

if y* > Zaand y= 0 if y* < Z'a. 

(a) Give the log-likelihood function for the observed data. 

(b) What is the effect of a one-unit change in xj; on Prfy; = 1]? 

(c) Suppose that y = 1 if y* > exp(z’a) and x = z. Do you see any problems in 
identifying œa and/or 6? Explain your answer. 


Consider the maximum score estimator with objective functions Sy() given in 

(14.29) and Qy(B) given in (14.30). 

(a) Show that Sy(B) = Xil (y; = 1) x 1(x;G > 0)+ 1(y = 0) x 1(x;8 < 0)]. 

(b) Show that Qy(B) = Xill (y; = 1) x 1(x/B < 0) + 1(y = 0) x 1(x;8 > 0)]. 

(c) Using 1(y, = 1) = 1 — 1(y; = 0), show that Qy(8) = N — Sy(Q). 

(d) Using 1(x; < 0) = 1 — 1(x;8 > 0) show that (14.29) can be rewritten as 
Sn(B) = X; (2y; — 1)1(x;B > 0) + N — Zi y. 

Use the health expenditure data of Section 16.6. The model is a probit regres- 

sion of DMED, an indicator variable for positive health expenditures, against just 

one regressor for simplicity, NDISEASE, the number of chronic diseases. 


(a) Obtain the OLS estimate of the slope parameter. 

(b) Obtain the probit estimate of the slope parameter. 

(c) Given part (b), obtain the marginal effect of chronic diseases in two ways: 
averaged over the sample and evaluated at the sample average of NDIS- 
EASE. 

(d) Obtain the logit estimate of the slope parameter. 

(e) Given part (d), obtain the marginal effect of chronic diseases in three ways: 
averaged over the sample, evaluated at the sample average of NDISEASE, 
and evaluated at A(x’) = Y. 
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(f) For the logit model calculate the proportionate change in the odds ratio 
when NDISEASE changes. 
14-6 Continue the analysis of Exercise 14.5. 
(a) Compare the three binary models on the basis of statistical significance of 


NDISEASE. 
(b) Compare the three binary models on the basis of the estimated marginal 


effect. 
(c) Compare the three binary models on the basis of the predicted probabilities. 
(d) Compare the logit and probit binary models on the basis of log-likelihood. 
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CHAPTER 15 


Multinomial Models 


15.1. Introduction 


The preceding chapter considered models for discrete outcome variables that can take 
one of two possible values. Here we consider several possible outcomes, usually mu- 
tually exclusive. Examples include different ways to commute to work (by bus, car, or 
walking), various types of health insurance (fee-for-service, managed care, or none), 
different employment status (full-time, part-time, or none), choice of recreational site, 
occupational choice, and product choice. 

Statistical inference is relatively straight forward in principle, as the data must be 
multinomial distributed, just as binary data must be Bernoulli or binomial distributed. 
Estimation is most often by maximum likelihood because the data are clearly multino- 
mial distributed. For some complications, however, moment-based estimation is used 
instead. 

Different multinomial models arise owing to different functional forms for the prob- 
abilities of the multinomial distribution, similar to the differences between probit and 
logit in the binary case. A distinction is also made between models where regressors 
vary across alternatives for a given individual and models where regressors are con- 
stant across alternatives. For example, in transportation mode choice some regressors, 
such as travel time or cost, will vary with choices whereas others, such as age, are 
choice invariant. 

The simplest multinomial model, the conditional or multinomial logit model, is 
quite straightforward to use but is viewed as too restrictive in practice, especially if 
the multinomial outcome data arise from individual choice. For unordered outcomes 
less restrictive models can be obtained using the random utility model. In this model 
the alternative with the highest utility is chosen, where utility from each alternative is 
the sum of deterministic and random components. Different specifications of the ran- 
dom components lead to different functional forms for choice probabilities and hence 
to different multinomial models. Additional models arise in applications where some 
structure can be placed on the decision-making process, such as a natural ordering of 
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alternatives or sequencing of decisions. In practice many different multinomial models 
are used. 

Section 15.2 presents an application to illustrate the issues discussed in this chap- 
ter. General results for multinomial models are given in Section 15.3. The conditional 
and multinomial logit models are presented in Section 15.4. The additive random util- 
ity model is presented in Section 15.5. The nested logit, random parameters logit, 
and multinomial probit models are the subject of Sections 15.6—15.8. Ordered and se- 
quential models are detailed in Section 15.9. Multivariate models with more than one 
discrete outcome variable are presented in Section 15.10. Semiparametric estimators 
are briefly reviewed in Section 15.11. 


15.2. Example: Choice of Fishing Mode 


This section illustrates multinomial logit, the simplest unordered multinomial model, 
and variations detailed in Section 15.4 that permit regressors to vary across alterna- 
tives. The emphasis is on interpretation of estimated models. The marginal effect of 
a change in a regressor is more complicated than the usual impact on a single condi- 
tional mean. For multinomial data there is instead a separate marginal effect on the 
probability of each outcome, and these marginal effects sum to zero since probabilities 
sum to one. 

The application is to choice of fishing mode. The dependent variable y takes value 
1, 2, 3, or 4 depending on which of the four mutually exclusive alternative modes 
of fishing — respectively, beach, pier, private boat, and charter boat — is chosen. An 
unordered multinomial model such as multinomial logit is appropriate, since there is 
no clear ordering of the outcome variable. Regressors are individual income, which 
does not vary with fishing mode, and price and catch rate, which do vary by fishing 
mode and across individuals. 

The sample of 1,182 people comes from a survey conducted by Thomson and 
Crooke (1991) and analyzed by Herriges and Kling (1999). The data are summarized 
in Table 15.1, which gives averages for the subsamples of people who chose each of 
the modes as well as the overall sample average of regressors. 


15.2.1. Conditional Logit: Alternative- Varying Regressors 


First consider the role of price and catch rate, regressors that vary across alternatives 
except that for these data the price of beach and pier fishing are the same. 

Looking down the columns of Table 15.1, we see that people tend to fish where it is 
cheapest for them to do so. For example, for people choosing to fish from the beach the 
average price was $36 compared to average prices of $36, $98, and $125 for the other 
modes. More generally, for people choosing the beach and pier these modes were on 
average much cheaper than the boat modes, and for people fishing from a boat this was 
on average much cheaper than beach or pier fishing. The relationship between mode 
choice and catch rate is less clear-cut, though it is clear that the charter boat has the 
highest catch rate. 
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Table 15.1. Fishing Mode Multinomial Choice: Data Summary 


Sub sample Averages 


y=1 y=2 y=3 y=4 Ally 
Explanatory Variable Beach Pier Private Charter Overall 
Income ($1,000s per month) 4.052 3.387 4.654 3.881 4.099 
Price beach ($) 36 31 138 121 103 
Price pier ($) 36 31 138 121 103 
Price private ($) 98 82 42 45 55 
Price charter ($) 125 110 71 75 84 
Catch rate beach 0.28 0.26 0.21 0.25 0.24 
Catch rate pier 0.22 0.20 0.13 0.16 0.16 
Catch rate private 0.16 0.15 0.18 0.18 0.17 
Catch rate charter 0.52 0.50 0.65 0.69 0.63 
Sample probability 0.113 0.151 0.354 0.382 1.000 
Observations 134 178 418 452 1182 


For alternative-specific regressors that vary across alternatives, such as price and 
catch rate, the multinomial logit model is called a conditional logit model (see Section 
15.4.1). The probability of the ith individual choosing fishing mode j is given by 

=== Expr hij + BcCij) tae 
J r=1 EXP(Bp Pik + BcCix) 
where P denotes price, C denotes catch rate, the subscript i denotes the ith individual, 
and subscript j or k denotes the alternative. This model is an obvious extension of 
binary logit and gives probabilities that lie between 0 and 1 and sum to one. Other 
multinomial models use a different functional form for p;j. 

The coefficient estimates are given in the CL column of Table 15.2. For the CL 
model, though not for all multinomial models, the sign of the coefficient is directly 
interpretable. Anticipating results from Section 15.4.3, since Bp < 0 we have that an 
increase in the price of one alternative decreases the probability of choosing that al- 
ternative and increases the probability of choosing other alternatives. Similarly, since 
Bc > 0, an increase in the catch rate for one alternative increases choice probability 
for that alternative and decreases the choice probability for other alternatives. 

A standard measure of the impact of changes in regressors is N~! De 1 OPij /9Xikr» 
the average marginal response of the probability of choosing alternative j when the 
rth regressor increases by one unit for alternative k and is unchanged for the other 
alternatives. For the CL model this is estimated by N~! X$ _; Dij(Sijx — Pit) By (see 
(15.18)), where B is the estimate of 8B and Pij, j=1,...,m, are the predicted 
probabilities. 

The average responses across the four modes for the two regressors price and catch 
rate are given in Table 15.3. The table gives the effect on choice probability of a 100- 
unit (or $100) change in price and the effect of a one-unit change in the catch rate. For 
example, an increase of $100 in the price of beach fishing leads to a decrease of 0.272 
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Table 15.2. Fishing Mode Multinomial Choice: Logit Estimates“ 


Model type 
Regressor Type Coefficient CL MNL Mixed 
Price (P) Specific Bp —0.021 — —0.025 
Catch rate (C) Specific Bor 0.953 - 0.358 
Intercept Invariant a, : Beach - 0.0 0.0 
a : Pier — 0.814 0.778 
a3 : Private — 0.739 0.527 
a4 : Charter — 1.341 1.694 
Income (I) Invariant Br, : Beach — 0.0 0.0 
Bn : Pier — —0.143 —0.128 
Br : Private — 0.092 0.089 
Bra : Charter — —0.032 —0.033 
—ln L —1311 —1477 —1215 
Pseudo- R° 0.162 0.099 0.258 


“ Type of regresssor is alternative-specific (price and catch rate) or alternative-invariant (income). Outcomes are 
(1) beach, (2) pier, (3) private, and (4) charter. MLE estimates are for conditional logit (CL), multinomial logit 
(MNL), and mixed logit (Mixed) models. MNL and Mixed models are normalized to base category beach. All 
estimates except that for 674 are statistically significant at 5%. 


in the probability of fishing and an increase of 0.119, 0.080, and 0.068, respectively, in 
the probability of fishing from a beach, a pier, a private boat, and a charter boat. Note 
that the changes in probabilities sum to zero, as expected. 

Calculation of these marginal effects and probabilities requires postestimation com- 
putation. A back-of-the-envelope calculation uses p;(ôjk — PB, for the CL model, 
where p; is the sample average probability. For the effect of a $100 change in the 
price of beach fishing on the probability of beach fishing this yields 100 x 0.113(1 — 
0.113) x (—0.021) = —0.21, compared to the sample average value of —0.272 in 
the table. This approximation becomes less reasonable as probabilities get closer 
to Oor lL. 

The results in Table 15.3 are consistent with the view that the greatest substitu- 
tion is between pier and beach fishing and between private boat and charter boat 


Table 15.3. Fishing Mode Choice: Marginal Effects for Conditional Logit Model“ 


One-Unit Change in 
$100 Change in Price of Catch Rate for 


Beach Pier Private Charter Beach Pier Private Charter 


Change in Pr 
Change in Pr 
Change in Pr 
Change in Pr 


beach] —.272  .119 085 068 .126 —.055 .040 .032 
pier] 119 —.263 .080 064 —.055 .122 —.037 —.030 
private] .080 .080 —.391 225 —.040 —.037 .182 —.105 
charter] .068  .064 225 357 —.032 —.030 105 .166 


Sooo 


^ Average marginal response of the probability of choosing each alternative when a regressor changes for one of 
the alternatives and is unchanged for the other alternatives. 
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fishing. Specifically, price increases, or catch rate decreases, for pier lead to sub- 
stitution to beach, and vice versa. A similar result holds for charter versus private 
boat. 

These choice probability changes are for large changes in the regressors, given that 
average price is $86 and average catch rate is 0.30. One can instead calculate elastic- 
ities. Elasticities for choice probabilities need to be used with care, however, because 
probabilities are bounded between 0 and 1. A change in predicted probability from 
0.01 to 0.02 will lead to an elasticity roughly 50 times larger than that for a change in 
predicted probability from 0.50 to 0.51. 


15.2.2. Multinomial Logit: Alternative-Invariant Regressors 


Now consider the role of income, measured as monthly income in thousands of dollars. 
From Table 15.1 it appears that as income rises the fishing mode moves progressively 
from pier, where average monthly income of people fishing at a pier is $3,387, to 
charter boat to beach and finally to private boat, where the average income is $4,654. 

Because income is invariant across alternatives the appropriate model is the multi- 
nomial logit model (presented in Section 15.4.1). This lets regressor coefficients vary 
across alternatives, with 

exp(aj + Bijli) 


Pij = Ply = j] = , j=l1l,...,4, 
Fai exp(ær + Brel) 


where J denotes income. A normalization of parameters is needed as a consequence 
of the restriction that probabilities sum to one. The empirical results set ~; = 0 and 
n = 0. 

The parameter estimates are given in the MNL column of Table 15.2. Coefficient 
interpretation is more difficult than for the CL logit model. In particular, for MNL 
models a positive regression parameter does not mean that an increase in the regressor 
leads to an increase in the probability of that alternative. Instead, interpretation for 
the MNL model is relative to the reference or base category group, here beach as 
the beach coefficients were normalized to zero. Compared to beach fishing a higher 
income leads to reduced likelihood of fishing from a pier (since B72 = —0.143 < 0) 
or a charter boat (since z4 = —0.032) and greater likelihood of use of a private boat 
(since 673 = 0.092). 

ane aucenttuds of the response to income changes can be measured using 
N- SDM ı Opi; /OT;, the sa en effect averaged over individuals. For the MNL mod- 
els this is pe by N7 gaa i Pij(B; - B;) (see (15.19)), where B; is the esti- 
mate of 6;, 8; =} ;-ı pi, is a probability weighted average of the ‘B,, and Pij, 
j =1,...,m, are the predicted probabilities. For the four choices a $1,000 increase 
in ionty income is associated with changes of 0.000, —0.021, 0.033, and —0.012 
in, respectively, the probabilities of fishing from beach, pier, private boat, and charter 
boat. This indicates little change in beach fishing, movement out of pier and charter 
boat fishing, and movement to private boat fishing. Since average monthly income is 
$4,100 the changes in probability are of reasonable size. 
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However, income alone is not a great discriminator for fishing mode choice. From 
the bottom of Table 15.2, we see that the MNL model has much lower log-likelihood 
and pseudo-R? than does the CL model. From output not given, across all individuals 
in the sample the predicted probabilities from the MNL model range from 0.095 to 
0.115 for beach, 0.036 to 0.234 for pier, 0.240 to 0.626 for private boat, and 0.244 to 
0.416 for charter boat. Since an intercept is included in the MNL model the averages 
of these predicted probabilities for each choice equal the sample average probabilities. 
This result for the MNL model is a consequence of (15.16) given later. 


15.2.3. Mixed Logit 


A richer model combines the two preceding models. This is done using a so-called 
mixed logit model (see Section 15.4.1) with 


; exp(Bp Pij + BcCij + a; + Bij li) 
Prly = I= a s 
eer €xp(bpr Pik + BcCik + ær + Breli) 


This model, not to be confused with the model of Section 15.7 which is also referred 
to as a mixed model, can be implemented as a conditional logit model 


exp(Bp Pij + BoCij + S (dij + BndTiji)) 
D exp(Bp Pik + BcCik + Si (dij + Budlij1)) 


where dj;; is a dummy variable equal to one if j = l and equal to zero otherwise, 
and dJjj; = diji; is equal to income if j = / and equals zero otherwise. In this case 
we regress y; on eight regressors: P;;, Cij, dij2, dij3, dija, A1ij2, dijz, and dljjq. 
Since a; = 0 and zı = 0 the regressors dj;, and dl;jı are omitted. Note that if we 
estimate this CL model with just the dij; and dl;;; as regressors then the CL estimates 
equal the MNL estimates given earlier. An MNL model can always be estimated as a 
CL model (see Section 15.3.4). 

While the mixed logit model is richer than the CL model, the CL model has the ad- 
vantage that if an additional alternative was added to the choice set then one can predict 
its probability of selection, since the parameters of the CL model do not vary across 
alternatives. 

The results are reported in the last column of Table 15.2. Compared to the first 
two models the coefficients are little changed, except for considerable change in the 
catch rate coefficient. This change is due to inclusion of the alternative-specific dum- 
mies, rather than inclusion of income. The mixed model is strongly preferred to the 
other models on the basis of much higher log-likelihood value or formal statistical 
tests. 


Prly: = j] = 


15.3. General Results 


The results in this section pertain to all multinomial models. The remainder of the 
chapter specializes to the many different specifications of the multinomial model used 
in practice. 
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15.3.1. Multinomial Models 


There are m alternatives and the dependent variable y is defined to take value j if the 


jth alternative is taken, j = 1,...,m. (Some authors instead consider m + 1 alterna- 

tives with j = 0, 1,..., m.) Define the probability that alternative j is chosen as 
p; =Prly = j], Jec ens m (15.1) 

Introduce m binary variables for each observation y, 
1 ify=j, 

a f i 15.2 
N | 0 ify Fj. oe 
Thus y; equals one if alternative j is the observed outcome and the remaining yg equal 
zero, so for each observation on y exactly one of y1, y2,..., Ym will be nonzero. The 


multinomial density for one observation can then be conveniently written as 


m 


fQ) = pt x ++ x pa = | [p7 (15.3) 
j=1 


For regression models introduce a subscript i for the ith individual and regressors 
x;. Specify a model for the probability that individual i chooses the jth alternative, 


pij = Priyi = jl = F; 6) fHl,...sm,  i=1,...,N. (15.4) 


The functional form for F; should be such that probabilities lie between 0 and 1 and 
sum over j to one. Different functional specifications for F; correspond to specific 
models, notably multinomial logit, nested logit, multinomial probit, ordered, sequen- 
tial, and multivariate models. These models are presented in subsequent sections. 


15.3.2. ML Estimation 


The multinomial density for one observation is given in (15.3). The aes function 
for a sample of N independent observations is then Ly = imi Ii- pi” j > Where the 
subscript i denotes the ith of N individuals and the subscript j denotes the jth of m 
alternatives. The log-likelihood function is 


e=intx= >>», In pij, (15.5) 


i=l j= 


where p;; = F;(x;, P) is a function of parameters 6 and regressors, defined in (15.4). 
More generally, the number of alternatives may vary across different individuals, so 
that m choices become m; choices. 

The first-order conditions for the MLE B are that it solves 


26 = yy HP o, (15.6) 


fai fai Pij 9 


which is usually nonlinear in 8. The distribution of y; is necessarily multinomial, so 
correct specification of the dgp means correct specification of the functional forms 
F;(x;, B) for the probabilities p;;. This ensures consistency as then E[y;;] = Pij, 
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so taking the expectation of (15.6) yields E[9£/9 6] = ane ae 0p;;/038, which 
equals zero since Po Pij =l. 

The usual asymptotic theory applies and the variance matrix is minus the inverse of 
the information matrix. Differentiating the double sum in (15.6) with respect to 3’ and 
using E[y;;] = pj; yields upon simplification 

-1 
) : (15.7) 
Bo 


Provided observations are independent over i there is no need to use more general 
sandwich forms of the variance matrix since the data are definitely multinomial dis- 
tributed and the information matrix equality will hold. 

As already mentioned, different models correspond to different choices of F;(x;, 3) 
for p;; and hence different expressions in (15.6) and (15.7). 

Maximum likelihood estimation for choice-based samples, such as those that 
oversample infrequently observed outcomes, is presented in Sections 14.5 and 24.4. 


m 


Bi N|B EF 1 dpi Opig Pij 
2 i=l j=l Pij 0B ap’ aBa p’ 


15.3.3. Moment-Based Estimation 


For simple cross-section applications the standard estimation procedure is the MLE. 

However, when complications such as endogeneity or correlation across observa- 
tional unit i arise, it can be more convenient to instead use moment-based estimators. 
Assuming the probabilities are correctly specified, we can consider any estimator with 
estimating equations 


N m 
S9 Ou- pyu =0, (15.8) 


i=l j=l 


where z;, a vector of the same dimension as 8, does not depend on y;j, for example, 
Zi = ðpij/Əß. This estimator will be consistent if the functional form for p;; is cor- 
rectly specified, as then E[y;;] = pij and the double sum on the left-hand side of (15.8) 
has expected value zero. The efficiency of the estimator will vary with the choice of z; 
and in the most general case GMM estimation procedures can be used. The estimating 
equations (15.8) are the basis for the method of simulated moments estimator for the 
multinomial probit model (see Section 15.8.2). 


15.3.4. Alternative- Varying Regressors 


Multinomial regression models differ not only in the choice of function Fj(-) in (15.4) 
but also in how regressors and parameters vary across the alternatives. 

At one extreme all regressors may be alternative-varying, meaning that they take 
different values for different alternatives. Let x;; denote the value of the regressors for 
individual i and alternative j, and let x; = [x’, x; ...x;,,,]/. Then (15.4) is usually of 
the form 


F(X, B) — F,(x;,8, FEFE X; „B), 
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where the parameters ( are constant across alternatives. An example is the conditional 
logit model defined later in (15.10). 

At the other extreme all regressors may be alternative-invariant, meaning that 
x; does not vary across alternatives. An example is individual socioeconomic char- 
acteristics in a model of transportation mode choice. Then (15.4) is usually of the 
form 


Fj i, B) = Fj; bi -o X Bm) 


where the parameters (3 ; differ across alternatives and 6 = [61 B3.. Bnl. Parameter 
identification requires a normalization such as 3, = 0. An example is the multinomial 
logit model defined later in (15.11). 

The distinction between alternative-varying and alternative-invariant regressors is 
of practical importance, as standard notation and computer programs for multinomial 
models work exclusively with one or the other. In practice, of course, some regressors 
may be alternative-varying and others alternative-invariant. In such cases it is best to 
use a program written for alternative-varying regressors, as it is possible to go from 
alternative-invariant regressors to the alternative-varying format. Let x; be a K x 1 
vector. Then define x;; to be a Km x 1 vector with zeros everywhere except that the 
jth block is x;, that is, 


x, = [0"--0 x, 0-07, 


and define 3 = [0 6; --- Bn], where 3, = 0 is a normalization. Then x; 8; = x) 3. 
The regressors are essentially included as interactions with alternative-specific dum- 
mies. An example was given in Section 15.2.3. It is also possible to go from the 
alternative-specific to the alternative-invariant format, but then (m — 1) parameter 
equality constraints need to be imposed for each of the alternative-specific regressors. 


15.3.5. Revealed Preference and Stated Preference Data 


The multinomial data used in microeconometric studies often arise from individual 
consumer choice. Consumer choice data may be either revealed preference data, 
which are data on actual decisions and outcomes, or stated preference data, which 
are survey responses to hypothetical questions. An example of revealed preference data 
would be actual occupational choice. An example of stated preference data would be 
a marketing study for fuel-efficient vehicles that asks a respondent to choose among 
various hypothetical vehicles that differ in characteristics such as fuel consumption, 
range, and price. 

Revealed preference data often provide little or no data on alternatives other than 
that chosen. For example, we may know the price to an individual consumer of the 
chosen product but not the prices of alternative products. The attraction of stated pref- 
erence data for multinomial modeling is that data are available on key variables such 
as price for all possible alternatives. This is particularly advantageous if one wishes to 
predict the probability of choice or market share of a new alternative on the basis of 
characteristics of the new alternative, as all parameters can be alternative-invariant if 
all regressors vary across alternatives. 
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There is some controversy in using stated preference data, because responses can 
vary with the wording of questions. Moreover, people may overstate or understate their 
willingness to pay to support particular policies. For example, some might overstate 
their willingness to support an environmentally friendly policy. 

Shopping scanner data are especially attractive because they provide data on re- 
vealed choice while at the same time data on prices across all alternatives are also 
provided. 


15.3.6. Model Evaluation and Selection 


Regression parameters in multinomial models can be difficult to directly interpret. 
Instead, it is useful to consider the marginal effect (or elasticities) of changes 
in regressors on outcome probabilities. Formulas for conditional and multinomial 
logit models are given in Section 15.4.3 and have been used in the Section 15.2 
application. 

Several model evaluation methods are presented in Amemiya (1981) and Mad- 
dala (1983). Using R? measures based on the analogue of squared residuals does not 
work well. Comparisons of predicted probabilities with actual outcomes are of lim- 
ited value as MNL models estimated with intercept impose in estimation the restric- 
tion that the average of the predicted probabilities equals the sample average prob- 
abilities for each alternative. It can be useful to look at the range of the in-sample 
fitted probabilities for each alternative. The wider the range the more discriminat- 
ing is the model. For more detail see the discussion in Section 14.3.7 for binary 
outcomes. 

Multinomial models are usually estimated by maximum likelihood. Thus to the 
extent that models are nested one can use standard likelihood ratio tests. When models 
are nonnested one can use variants of Akaike’s information criteria based on the fitted 
log-likelihood with a degrees-of-freedom adjustment for the number of parameters in 
the model (see Section 8.5.1). 

A useful pseudo- R? measure, due to McFadden (1973), is 


R? = 1 — In La/ 1n Lo, (15.9) 


where In Lg denotes the fitted model and Lo denotes an intercept-only model that 
estimates the probability of each alternative to be the sample average. For any multi- 
nomial model the theoretical maximum value of the log-likelihood is zero. This arises 
if pij; = 1 when y;; = 1 and p;; = 0 otherwise, for i and j. Thus the R? measure can 
be rewritten as 


R2 In Let — ln Lo 
~ In Lmax — In Lo ` 


This can be interpreted as the fraction of the maximum potential gain in log-likelihood 
that is achieved by the fitted model (see Section 8.7.1). 
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15.4. Multinomial Logit 


The simplest multinomial model is the multinomial logit model, proposed by Luce 
(1959). The commonly used variants of this model differ according to whether or not 
regressors vary across alternatives. Many of the issues presented in this section pertain 
to other models presented more briefly in subsequent sections. 


15.4.1. Conditional, Multinomial, and Mixed Logit Models 


For alternative-varying regressors (see Section 15.3.4) the conditional logit model is 
used. The CL model specifies 


; 
euP 


Pi = mB j=l,...,m. (15.10) 
Since exp(x;,3) > 0 these probabilities lie between 0 and 1 and sum over j to one. 
Indeed, once one has seen the formula (15.10) it appears to be the most simple speci- 
fication that ensures well-behaved probabilities. Because X- | Pij = 1 an equivalent 
model is obtained by defining x;; to be deviations of regressors from values of alterna- 
tive 1, say, and settting x;; = 0. 

When instead the regressors do not vary over alternatives, the multinomial logit 
model is used. The MNL model specifies 

e*i Pi . 
Pij = S NA Pa E (15.11) 

Because )7;_, Pij = 1, a restriction is needed to ensure model identification and the 
usual restriction is that G, = 0. 

The two models can be combined into what some authors call a mixed logit model, 
with 

exiiBtWiyi 


ae ee, 15.12 
rm extn? I i ene) 


Pij = 
where x;; vary over alternatives and w; do not vary over alternatives. As discussed 
in Sections 15.2.3 and 15.3.4, the mixed and MNL models can be reexpressed as a 
CL model. Note that the term mixed logit model is also sometimes used for a quite 
different model detailed in Section 15.7. 

All these models can be given the general label of multinomial logit, but we follow 
the standard convention in distinguishing between the MNL and CL models. 

An obvious generalization of the multinomial logit model is 

Pij = m > j=l,...,m, (15.13) 

where V;; > 0 can be quite general functions of regressors x; and parameters 3. This 
is called the universal logit model. Although this can generate a potentially rich class 
of models it is seldom used in econometrics as it does not arise naturally from choice 
theory. 
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15.4.2. ML Estimation of CL and MNL Models 


We present key formulas for the conditional logit and multinomial logit models. Com- 
plete derivations are given in Section 15.12. 

For the CL model, where p;; is defined in (15.10), 0p;;/08 = pij(%ij — Xi), where 
X; = J; pilXi is a probability weighted average of the regressors (see Section 
15.12.1). The CL first-order conditions, given in (15.6) for general p;;, simplify im- 
mediately to 


X yi Kj — %) = 0. (15.14) 


1 j=l 


M= 


i 


7 


Differentiating with respect to 8’, 
algebra yields 


using E[y;;] = pij, and performing some further 


i=l j 


1 
ata [a 2 (OX puts — Xi (Xi; sy) l ; (15.15) 


For the MNL model, p;; is defined in (15.11) and it is shown in Section 15.12.2 that 
Opi /OBe = PijOijk — Pik)Xi, Where 4;;, is an indicator variable equal to 1 if j =k 
and equal to 0 if j Æ k, and that the resultant MNL first-order conditions simplify 
after some algebra to 


N 
=) (ik — pik) =0, 9 k= 1,...,m. (15.16) 
i=1 


As usual Bun. ~ NIB, (E[d2L/d393'])~'], where further algebra shows that the in- 
formation matrix has jkth block 


32 | N 
E| ——_] =) pulik- pi)X:Xi, j=1,...,m, k=1,...,m. (15.17) 
Era 2 moe, 


15.4.3. Regression Parameter Interpretation 


Care is needed in the interpretation of parameters in any nonlinear model. This is 
particularly so for multinomial models where, for example, there is not necessarily a 
one-to-one correspondence between coefficient sign and coefficient probability. Here 
we present results used in the Section 15.2 application. 


Marginal Effects and Elasticities 


We focus on marginal effects on the choice probabilities of a change in the regressor 
for a given individual. Elasticities can then be computed by multiplying the marginal 
effect by the current regressor value and dividing by the probability. Typically these are 
then averaged over individuals to give an average marginal effect or average elasticity. 

For the CL model consider the effect on the jth probability of changing by one 
unit the value of the regressor for the kth alternative. For example, what is the effect 
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on the probabilities of choosing various modes of transportation if travel time by bus 
increases by a minute whereas the travel time by other modes is unchanged? From 
Section 15.12.1 
OP ij 
OXik 


= pij(Oijx — Dik) B. (15.18) 


where 6;;, was defined after (15.15). It follows that if the regression coefficient is 
positive then an increase in the corresponding component of the regressor value for 
the kth alternative increases the probability of the kth alternative and decreases the 
probability of the other alternatives. 

For the MNL model consider instead the effect on the jth probability of changing 
by one unit a regressor that takes the same value across all alternatives. For example, 
what is the effect on the probabilities of choosing to work if age increases by one year? 
From Section 15.12.2 

OPij _ 
ax; 


p(B; — Bi), (15.19) 


where 3; = Š; pil; is a probability weighted average of the (G;. It follows that the 
sign of the response is not necessarily given by the sign of G;, unless 3; > 8B, for 
all k Æ j, and it does not necessarily make any sense to test whether a particular co- 
efficient is zero. As in other nonlinear models we may compute the average response 
N7! X; Opi /0X; = N7! X; Pij(B; — BD., or we can use noncalculus methods and 
compare the change in the average predicted probability as regressors change. 


Comparison to Base Category 


The coefficients in the CL and MNL models can also be given a more direct logit-like 
interpretation in terms of relative risk (detailed in Section 14.3.4). This is because the 
models can be reexpressed as binary logit models. 

For the MNL model, comparison is to a base category, which is the alternative 
normalized to have coefficients equal to zero. To see this note that the multinomial logit 
probabilities (15.11) imply that the conditional probability of observing alternative j 
given that either alternative j or alternative k is observed is 


ES ete xt, Pj 
Priy dy = j kl = pepe 
Z eP) 
O BP 
=i} e0 A 
which is a logit model with coefficient (8; — 6p). The second equality comes af- 
ter some simplification. Suppose normalization is on alternative 1, so that 6; = 0. 
Then 
e*ibi 


Pr[y; = j|y; = j or 1] = —, 
Deza aA 
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and 3; can be interpreted in the same way as the logit model coefficient for binary 
choice between alternatives j and 1. Similarly to the binary logit model the relative 
risk of choosing alternative j rather than alternative 1 is 


Prly; = 
Pr[y; = 1] 


_ Xb; 
= e^i, 


and hence e°» gives the proportionate change in this relative risk when x;, changes by 
one unit. Such interpretations will vary according to which alternative is normalized to 
have zero coefficient, and for this interpretation to be really useful one needs to have 
a natural base category. For example, if interest lies in various alternative commute 
modes to traveling by car then normalize the coefficients for the car alternative to equal 
Zero. 

A similar approach can also be applied to the CL model, with 


e®ii— šik YB 


Pr[y; = jlyi = j ork] = RPE (15.21) 


and normalization now is with respect to regressor values for a base category. 


15.4.4. Independence of Irrelevant Alternatives 


A limitation of the CL and MNL models is that discrimination among the m alterna- 
tives reduces to a series of pairwise comparisons that are unaffected by the character- 
istics of alternatives other than the pair under consideration. This is clear from (15.20) 
and (15.21), which show that the MNL model reduces to a binary choice logit model 
between any pair of choices. The conditional probability does not depend on other 
alternatives. 

As an extreme example, the conditional probability of commute by car given com- 
mute by car or red bus is assumed in an MNL or CL model to be independent of 
whether commuting by blue bus is an option. However, in practice we would expect 
introduction of a blue bus, which is the same as a red bus in every aspect except color, 
to have little impact on car use and to halve use of the red bus, leading to an increase 
in the conditional probability of car use given commute by car or red bus. 

This weakness of MNL is known in the literature as the red bus—blue bus prob- 
lem, or more formally as the assumption of independence of irrelevant alternatives. 
It can be tested by a Hausman test (see Hausman and McFadden, 1984). For exam- 
ple, we could compare the coefficient estimates of red bus in a three-choice model of 
car, red bus, and blue bus, again with car the base category, with the coefficient esti- 
mates of red bus in a binary choice model of car and red bus, again with car the base 
category. 

Much of the econometrics literature has focused on alternative unordered models 
that do not have this weakness. These models are presented in Sections 15.6-15.8. 
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15.5. Additive Random Utility Models 


Unordered multinomial models more general than multinomial and conditional logit 
can be obtained using the general framework of additive random utility models, pre- 
sented in this section. Subsequent sections describe the leading examples. 


15.5.1. ARUM 


The additive random utility model was introduced in Section 14.4.2 for binary out- 
comes. In the general m-choice multinomial model the utility of the jth choice is 
specified to be given by 


Uj=Vjte;,  j=1,2,...,m, (15.22) 


where V; denotes the deterministic component of utility and £; denotes the random 
component of utility. For the ith individual usually V;; = x;;6 or Vi; = x;G;, though 
more structural analysis may specify direct or indirect utility functions used in con- 
sumer demand theory. For notational simplicity we suppress the individual subscript i 
in the following. 

The chosen alternative is that with the highest utility, so that 


Prly = j] =Pr[U; > Ux, all k Æj] (15.23) 


=Pr[U,—U; <0, allk Fj] 
=Prle,—e,; < V; — Vk, all k Æj] 
= Pr[čy < Vy, allk £j], 


where the tilda and second subscript j denotes differencing with respect to reference 
alternative j. 

Different multinomial models can be generated by different assumptions about the 
joint distribution of the error terms. These models are valid statistically, with proba- 
bilities summing to one. Additionally, they are consistent with the standard economic 
theory of decision making. 

For example, consider the expression for Pr[y = 1] in a three-choice model. Using 
the last equality in (15.23) and defining €3; = £3 — £1 and €2) = £2 — €; we have 


Prly = 1] = Prlé <—Vo1, %31 < — V31] (15.24) 


-Vi p- ns 
= fos foes Far, 831) de 21 de 31, 


which is a bivariate integral that generally does not have an analytical solution. More 
generally, an m-choice model involves an (m — 1)-variate integral that may or may not 
yield a closed-form solution for Pr[y = j]. 

In general all the errors £1, €2,..., &m may be correlated across choices. Some co- 
variance restrictions are necessary, however, as the model is identified only up to the 
(m — 1) error-difference pairs (see the last equality in (15.23)), and additionally one 
variance needs to be specified since the U; are only determined up to scale. 
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15.5.2. Different Unordered Multinomial Models 


Different unordered multinomial models arise from different assumptions on the joint 
distribution of £1, £2, . . . , €m. Analysis is simplest if the error assumptions lead to a 
closed-form solution for the choice probabilities. However, in many applications these 
assumptions are felt to be too restrictive. 

The computationally-intesive methods summarized in Chapter 12 permit estimation 
even if there is no closed-form solution for the choice probabilities. Sections 15.7.2 and 
15.8.2 present multinomial examples of these methods. 


Type 1 Extreme Value Errors 


We first assume that the errors £; are iid type 1 extreme value, with density 
f(éj) =e * exp(—e-”), PELI: (15.25) 


The properties of this density were given in Section 14.4.2, where it was shown to lead 
to a logit model in the binary outcome case. 

For multinomial outcomes modelled using the ARUM with type I extreme value 
errors it can be shown that (15.23) yields 


evi 


eVi te +... 4 en" 


Prly = j] = (15.26) 


This is the CL model when V; = xi 6 and the MNL model when V; = x’G;. The result 
can be obtained either by integration and simplification similar to the binary case (see 
Section 14.8), or as a special case of the nested logit result derived in Section 15.6. 
Thus conditional and multinomial logit models can be obtained from an ARUM. 

The assumption that the errors €; are independent across alternatives j is too restric- 
tive as it is likely to be violated if two alternatives are similar. For example, suppose 
alternatives 1 and 2 are similar. A low value of £; (i.e., large and negative) leads to 
overprediction of the utility of alternative 1. We then also expect to overpredict the 
utility of alternative 2, so that £2 also takes a low value. Since low values of €; and €2 
tend to go together, and similarly for high values, the errors must be correlated. This 
is another way of viewing the “red bus—blue bus” problem, and it is a manifestation of 
a failure of the logit assumption of independence of irrelevant alternatives. 

The generalized extreme value model and the nested logit model (see Section 15.6) 
relax the assumption that the extreme value errors are independent across choices. The 
errors are grouped with independence across groups but correlation permitted within 
groups. Closed-form solutions are then available for the choice probabilities. Although 
these models are richer than the MNL model, the special case of no correlation within 
groups, in many applications the grouping of errors can be somewhat arbitrary. 

The random parameters logit model (see Section 15.7) introduces additional ran- 
domness into the MNL model that induces correlation of utilities across alternatives. 
This is an example of a generalized random utility model (see Section 15.7.3). 
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Normally Distributed Errors 


The multinomial probit model (see Section 15.8) arises if the errors £1, ..., € are 
assumed to be joint normal distributed. This error assumption is a more natural starting 
point than one of type 1 extreme value. It permits a very rich correlation structure, at 
the expense of the need to use numerical or simulation methods that accommodate an 
(m — 1)-variate normal integral. 


15.5.3. Consistency with Random Utility Models 


It is always possible to present an analytical expression for choice probabilities that lie 
between zero and one and that sum over alternatives to one. A quite general example 
is the universal logit model in (15.13). The econometrics literature has placed great 
emphasis in restricting attention to multinomial models that are consistent with maxi- 
mization of a random utility function. This is similar to restricting analysis to demand 
functions that are consistent with consumer choice theory. 

Let V = (V,,..., Vm). From Borsch-Supan (1987, p. 19), a set of choice probabil- 
ities pj(V), j = 1,...,m, is compatible with maximization of an ARUM if 


1. pj(V) = 0, X7 pi(V) = 1, and pj(V) = pj(V + a) for alla € R; 
2. dpj;(V)/0V. = Op,(V)/9V;; and 


3. a) p;(V)/3 Vi... [9 V;]... 3 Vm = 0, where the square bracket denotes a term to be 
dropped out. 


These conditions, due to Williams (1977), Daly and Zachary (1979), and McFad- 
den (1981), ensure in turn (1) well-behaved probabilities and translation invariance; 
(2) integrability of p; similar to the Slutsky condition; and (3) that the distribution 
function of the errors in the corresponding ARUM has a proper (nonnegative) density 
function. 


15.5.4. Welfare Analysis 


A major advantage of using a multinomial model that is a random utility model is that 
it permits welfare analysis. Then one can place a dollar value on the effect of changing 
one or more of the determinants of choice, such as price or time cost of travel in 
transportation mode choice. 

Standard welfare analysis uses compensating variation or equivalent variation. 
The deterministic component of utility in (15.22) is specified as the indirect utility 
function 


V; = V(I — pj. x), (15.27) 


where J denotes income, p; is the price of the jth alternative, and x; are characteristics 
associated with the jth alternative. For notational simplicity the unknown regression 
parameters 8 are suppressed. Then utility of alternative j is 


U; = UC — pj, xj, €;) = VU — pj, Xj) + £j. (15.28) 
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Suppose we change the characteristics from x) to x) . Then compensating variation 
CV is the change in income needed to hold utility at its initial level, so that the highest 
utility level attainable with income / and characteristics x’ must equal the highest level 
attainable with income (J — CV) and characteristics x’ . Thus compensating variation 
CV is implicitly defined as the solution to 


max UC — pj, X; €j) = _max Ud — CV = pj, X}, £j). (15.29) 
j=1,...,m j=1,...,m 


As an example, consider a two-choice model where U; = I +xj + €j, j = 1,2, 
and the scalar x; changes from x’, to x’. Then there are four possibilities. If alterna- 
tive 1 is chosen before and after then CV = (x/ — x}), since then UY = I — CV + 
x] +e, = I +x; +e, = Uj. Similarly, if alternative 2 is chosen before and after then 
CV = (x4 — x}). If switching occurs from alternative 1 to alternative 2 then Uy) = U{ 
implies 1 — CV + xj + £2 = I + x} + £1, which implies CV = x} — x; +62 — £1. 
Similarly, if switching occurs from alternative 2 to alternative 1 then CV = x — x5 + 
£1 — £2. More generally, for m choices the compensating variation in this simple exam- 
ple is CV, = Vý — V; + £x — £j if the change in x leads to a change from alternative 
j to alternative k. 

The compensating variation depends on observables (7, p;, and x;), parameters that 
can be estimated, and on unobservable errors ¢ ;. The unobservables are eliminated by 
computing the expected compensating variation E[CV], which involves integrating 
over £j. From the preceding example it should be clear that this integration can be 
quite difficult. Dagsvik and Karlström (2004) provide quite general results, discussed 
further in Section 15.6.5. 

For some models there is no analytical solution for E[C V]. Then one instead needs 
to numerically integrate over ¢; the function for CV defined in (15.29). From Sec- 
tion 12.3.2 this integral can be simulated in the following way: 


1. At iteration s draw e* from the distribution of € = (€),..., Em). 
2. Calculate CV* from max U(I — p;,x,,¢) = max UU —CV* — pj, x", 81). 
i) eee m J J GH 1 ass, m J J 


3. Repeat steps 1 and 2 S times. 
4. Estimate E[CV] by S! Y$, CV". 


This method yields E[C V] for each individual in the sample. Averaging, possibly 
with weighting, provides a population estimate. Application to the GEV model is dis- 
cussed in Section 15.6.5. 


15.6. Nested Logit 


The nested logit is the most analytically tractable generalization of the multinomial 
models. It is the ideal model to use when there is a clear nesting structure, but not all 
multinomial choice applications have an obvious nesting structure. 
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15.6.1. Generalized Extreme Value Model 


McFadden (1978) proposed a quite general class of model based on the assumption that 
the joint distribution of the errors is the generalized extreme value (GEV) distribution 
with joint distribution function 


F(E€1, €2,..-,€m) = exp[—G(e “',e-”,...,e ")], (15.30) 


where the function G(Y,, Y2,..., Ym) is specified to satisfy a number of assumptions 
including nonnegativity, homogeneity of degree one, mixed partial derivatives that are 
continuous and nonpositive for even order and nonnegative for odd order, and limy,-, oo 
G(Y1, Yo,..., Ym) = oo. These assumptions ensure that the joint distribution and re- 
sulting marginal distributions are well defined and that probabilities sum to one. 

If the errors are GEV distributed then an explicit solution for the probabilities in the 
random utility model (15.22) can be obtained, with 


Gj(e",e,...,e°%™) 
Piya vy Yj ; , ; f 
Pj D=j]=e G(e-“', e72, ..., e7 Vn) 


(15.31) 


where G;(Y1, Yo, ..., Ym) = 9G(Y1, Y2,..., Ym)/ƏY; (see McFadden, 1978, p. 81). 

A wide range of models can be obtained by different choices of G(Y1, Y2, ..., Ym). 
The MNL model is obtained if G(Y1, Y2, ... , Ym) = X}; Yz; hence the MNL model 
is a GEV model. The other widely used GEV model is the nested logit model. 


15.6.2. Nested Logit Model 


The nested logit model breaks decision making into groups. A simple example is to 
consider choice of college, where people first decide whether to go to a two-year or 
four-year college, and then within each of these paths whether to go to a public or 
private college. The situation is depicted as follows: 


College 
fe X 
2 year 4 year 
Va X va N 
Private Public Private Public 


The errors in a random utility model are permitted to be correlated for each option 
within the two-year and four-year groups, but they are uncorrelated across these two 
groups. 

More generally, we suppose that at the top level there are J limbs to choose from. 
The jth limb has K; branches numbered j1,..., jk,..., jJK;. The utility for the al- 
ternative in the jth of J limbs and kth of K ; branches is then 


Uje=Vateyn, k=1,2,...,Kj j=1,2,.. J, (15.32) 
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where for an m-choice model Kı +---+ Kz = m. This is illustrated as follows: 


Root 
x | Dx 
limb 1 o limbj +e limb J 
Z | N | / | aN 
branch | --- branch Ky +-+- branchk --- branchl --- branch Ky 
Viirtéi += Vig tera, e Vetem Vatez = Vix, +€sK, 


There can be additional levels, with the third level being a twig, etc. For notational 
simplicity we present results for a two-level model. 

For any model with this nesting pjg, the joint probability of being on limb j and 
branch k, can be factored as p;, the probability of choosing limb j, times pj, the 
probability of choosing branch k conditional on being on limb j. Thus 


Pjk = Pj X Pkij- 


The nested logit model of McFadden (1978) arises when the error terms ¢ ;, have 
the GEV joint cumulative distribution function 


F(e) = exp[—G(e*",...,@ 5 00.37" e7] (15.33) 


for the following particular specification of the function G(-): 


J K; Pj 
G(Y) = GY,- -, Yik -0 Yj... YJK,) = 5 (>: vp (15.34) 
k=1 


j=l 


The parameter pj is a function of the correlation between £ją and £j; but does 
not exactly equal the correlation parameter. In fact p; can be shown to equal 
y1 — Cor [e€ jx, €j1], so pj is inversely related to the correlation and we expect 0 < 
p; < 1. The choice p; = 1 corresponds to independence of ¢;, and £j; and leads to 
the MNL model. We call the parameters p; the scale parameters, as they scale re- 
gression parameters in the models considered in the following. 

Notation varies considerably across authors. McFadden (1978) and Maddala (1983) 
instead define this cdf in terms of oj = 1 — pj, called the dissimilarity parame- 
ter. Others use u; = 1/p;. Many authors model alternative ij for the nth individual 
whereas we model alternative jk and reserve i for the ith individual. 

The outcome indicator variables yj, equal one if alternative jk is chosen and 
zero otherwise. Then from (15.32), pjx = Prlyjx = 1] = Pr[Ujx > Um, for all J, m]. 
Closed-form solutions for the probabilities pj,, as a function of the Vj, and pj, are 
derived in Section 15.12.3. These are then evaluated for the particular deterministic 
utility function 


V= æ+ B kale Ky folios J, (15.35) 


where z; varies over limbs only and xj, varies over both limbs and branches. The 
parameters œ and 3; are called regression parameters. 


509 


MULTINOMIAL MODELS 
The GEV model (15.32)—(15.35) yields the nested logit model 
exp (z jt pjl i) exp (xaBiles) 
x 


/ 
J 

Pjk = Pj X Pj = 
(Za a+ Pm In) S exp (xaBile) 


; (15.36) 
Doe 1 EXP Z 


see Section 15.12.3, where 


Kj 
=In (> exp eeso) (15.37) 
l=1 


is called the inclusive value or the log-sum. One attraction of the nested logit model 
is that the probabilities p; and pj); are essentially of conditional logit form. 

The preceding results are for regressors that vary across alternatives. The algebra 
can be adapted to alternative-invariant regressors Vjx = z’aj + x’ jx, with a normal- 
ization of one of the 8 ;,. Algebraically all that is needed is a partition Vj, = Aj + Bjr, 
where A ; pertains to the limb and B ;; pertains to both limb and branch. 


15.6.3. Estimation of Nested Logit 


For the ith observation we observe Kı +---+ Ky outcomes y;jg, where y;jg = 1 if 
alternative jk is chosen and is zero otherwise. Then p;jx¢ = Pixjj X pij and the density 


for one observation y; = (yj11,---, YisK,) can be compactly expressed as 
J Kj J Kj 
fyi) = [Pin x pil” =] [|p fy iT] pu) , 
j=1 k=1 j=l k=l 


where y;; = ye yiji equals one if limb j is chosen and equals zero otherwise. 
The density for the sample is Tes f (yi). The FIML estimator maximizes 


N J N J 
InL =) 9 yjlnpy +) 9 9 yin pies, (15.38) 
i=l j=l i=1 j=l k=l 
with respect to parameters a, 3 ;, and pj. 

An alternative, less-efficient estimation is the sequential estimator or LIML esti- 
mator that exploits the partitioning of pj, into the product of px; and pj. The first 
stage bases estimation on the second term of the right-hand side of (15.38), which 
from (15.36) is a conditional logit model with estimated parameter (3 ;/p;. The second 
stage bases estimation on the first term of the right-hand side, which from (15.36) is a 
conditional logit model with added regressor T; j» an estimate of the inclusive value in 
(15.37) that can be computed using the first-stage parameter estimates. The @ and P; 
are obtained directly from the second stage, whereas B j equals p ; times the first-stage 
estimate 8; /0;- 

This sequential estimator is less efficient than the FIML estimator, and at the second 
stage the usual CL standard errors understate the true standard errors of the sequential 
estimator since they do not allow for the estimation error in computing the inclusive 
value. McFadden (1981) gives the formula for correct standard errors, or the bootstrap 
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can be used. The sequential alternative estimator was originally proposed at a time 
when even conditional logit model estimation was challenging. Now it is relatively 
simple to code the likelihood function, so it is best to use FIML. Sequential estima- 
tion is potentially useful to provide starting values as the FIML log-likelihood is not 
globally concave. 

As an example we applied the nested logit model to the data of Section 15.2. The 
nesting structure was shore or boat fishing at the higher level, with lower levels beach 
or pier (for shore fishing) and private or charter (for boat fishing). The regressors x jx 
in (15.36) that vary at the lower level were price (P) and catch rate (C). The regressors 
Zj at the higher level that vary across shore or boat were an indicator variable d equal 
to one if shore fishing and d x J, income interacted with the shore fishing indicator. 
Estimation by conditional logit (corresponding to p1 = p2 = 1) yielded a fitted model 
with In L = —1252, as expected smaller than the log-likelihood for the similar but 
less restricted model given in the last column of Table 15.2. FIML estimation of the 
corresponding nested logit model, with p; and p2 now free to vary, led to a much higher 
log-likelihood model and rejection of the more restricted conditional logit model using 
the x7(2) likelihood ratio test statistic. 


15.6.4. Discussion 


The main limitation of the nested logit model is that not all choice problems lend them- 
selves to an obvious nesting structure. One can still select the optimal nesting scheme 
using likelihood ratio tests, where appropriate, or Akaike’s information criteria. How- 
ever, the resulting scheme does not always accord with a priori expectations. 

Another practical issue is that consistency of the nested logit model with choice 
from an ARUM requires that the three conditions in Section 15.5.2 are satisfied. The 
third of these conditions is satisfied globally if 0 < pọ; < 1, and with more than two 
levels of nesting it is additionally required that p at higher levels of the nest structure 
does not exceed p at lower levels of nesting. In practice it is possible to obtain estimates 
of p; outside the unit interval. One can still use the model, as the choice probabilities 
are proper, but the model may no longer come from an ARUM. Borsch-Supan and 
others have considered local identification conditions under which the nested logit 
model may be consistent with ARUM even if p; lie outside the unit interval. It can 
also be useful to do a grid search over p; to constrain p; to the unit interval and to 
enumerate the reduction in log-likelihood, if any, caused by doing so. 

The nested logit model defined in (15.36) and (15.37) was proposed by McFadden 
(1978), who derived it as a GEV model. An earlier variant of the nested logit 
model was similar to (15.36) and (15.37), except that exp(x’,, 8; /p;) was replaced by 
exp(x’; /3;)- This had an alternative derivation as a natural extension of the CL model, 
since CL is the special case of (15.36) and (15.37) with p; = 1. See McFadden (1978, 
p. 79), Maddala (1983, p. 70), and Greene (2003, p. 726). 

It is very important to note that the two variants differ if o; differs across alter- 
natives; see Koppelman and Wen (1998) and Train (2003, p. 88). Some early studies 
obtained sequential estimates that differed substantially from FIML estimates, casting 
doubt on the robustness of the nested logit model. However, in some of these studies 
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the different estimators were being applied to different variants of the nested logit 
model. Furthermore, even today different packages estimate different variants of the 
model. 

The nested logit model can be extended to higher levels of alternatives (or nesting). 
For example, Goldberg (1995) has five levels: (1) buy any car; (2) buy a new car given 
yes to 1; (3) which of nine classes of car was purchased given yes to 2; (4) foreign 
or domestic; (5) model. An added attraction if some nests have numerous choices is 
that it is sufficient to base estimation on a fixed or randomly selected subset of the 
alternatives (see McFadden, 1978). 


15.6.5. Welfare Analysis 


Welfare analysis for the ARUM was presented in Section 15.5.4. In general there is no 
solution for E[C V], the expected compensating variation. 

Remarkably, for GEV models that are linear in income, VU — pj, xj) = a(U — 
Dj) + f(x), McFadden (1995) and earlier workers show that there is an explicit 
solution 


BICV]= + (na (e",...,e%) —ma(el,...,e%)). 


where the function G(-) for the GEV distribution is defined in (15.34), and V; and V; 
are the before and after values of the deterministic component of utility. 

For GEV models with income appearing nonlinearly, however, there is no explicit 
solution. Then one approach is the simulation method given in Section 15.5.4. For a 
multinomial logit model this is simple as it is easy to draw extreme value errors using 
the transformation method of Section 12.8.2 — draw u from the uniform on (0, 1) and 
then set e = — In(— In(@w)). For a more general nested logit model, however, it is diffi- 
cult to randomly draw from a GEV distribution even as simple as the bivariate extreme 
value. McFadden (1995) proposed using the MCMC with the Metropolis—Hastings al- 
gorithm (see Section 13.5). Herriges and Kling (1999) give an excellent summary of 
this simulation method with application to nested logit models for the fishing data of 
Section 15.2, using various indirect utility functions including the translog. 

More recently, Dagsvik and Karlström (2004) show that although there is no explicit 
solution for E[CV] in the GEV model if income enters nonlinearity, it is analytically 
possible to reduce E[CV] to a one-dimensional integral. Computing this integral us- 
ing Gaussian quadrature will be much simpler than employing the afore-mentioned 
simulation method. 


15.7. Random Parameters Logit 


The random parameters logit model provides a simple way to generalize the MNL 
or CL model to permit the utilities of each alternative to be correlated. The model is 
perhaps the leading microeconometrics example of a random parameters model for 
cross-section data. 
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15.7.1. Random Parameters Logit Model 


The random parameters logit (RPL) model specifies the utility to the ith individual 
for the jth alternative to be 


Uij = X; bi + ij, JHA 2m, (15.39) 


where ¢;; are iid extreme value, as for the CL model, but additionally permits the 
parameters 8; to be random. The most common assumption is that 


Bi ~ NIB, Xgl. (15.40) 


One variation is to use the log-normal rather than normal distribution for parameters 
whose sign is known a priori. This model is also called a mixed logit model, using 
terminology borrowed from the panel setting for models with random parameters. By 
reexpressing the MNL model as a CL model, the results that follow also cover a ran- 
dom parameters MNL model. 

The model can be rewritten as 


/ 
Ui; = Xb + vijs 


/ 
Vij = X;jU; + £ij, 


where u; ~ N[O0, Xg]. Then Cov[v;j, viz] = X; UBXiks j Æ k, so the introduction 
of random parameters has the attractive property of inducing correlation across 
alternatives. 

In most applications the covariance matrix Xg is specified to be diagonal, and addi- 
tionally some of the diagonal entries may be set to zero. Then the number of covariance 
parameters to estimate equals the number of components of 8; that are specified to be 
random. 

As an example, consider a mixed CL model with scalar regressor and parameters 
B and Op. Suppose the parameter estimates are B = 2.0 with standard error 0.5 and 
oR = 1.0 with standard error 0.2. Then the null hypothesis of constant parameter, that 
is, op = 0, is strongly rejected since t = 1.0/0.2 = 5.0. The effect on Pr[y; = j] of 
an increase in x;; differs across individuals and is positive for about 97.5% of the 
sample, since it is estimated that 8; ~ MN[2.0, 1.0]. For an application that emphasizes 
interpretation of estimated coefficients, see Revelt and Train (1998). 

The industrial organization literature considers aggregation over consumers of 
models similar to the RPL model to estimate demand parameters using market- 
level data. See, for example, Berry (1994) and Nevo (2001), and also Allenby and 
Rossi (1991). 


15.7.2. Estimation of Random Parameters Logit 


In the linear regression model with random parameters, OLS estimation yields esti- 
mates of the means @ that are consistent though inefficient. In a nonlinear model, 
however, estimators that fail to control for the randomness of the parameters will be 
inconsistent. Thus the usual conditional logit MLE will be inconsistent if the dgp is 
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given by (15.39) and (15.40). Instead, ML estimation must explicitly account for the 
stochastic process for 6;. 

If 6; were known, so that the only source of randomness is ¢;;, a CL model is 
obtained with probability p;; = eibi j Dove eX, Since 3; is in fact random we need 
to integrate out this randomness. This yields 


x’ B; 
evi 
i =Prly = j= | — ze (GIG. Ug) dB;. 15.41 
Pij Ly J] I = EA (G;| B) ( ) 


where the integral is multidimensional and ¢(8;|6, X g) denotes the multivariate nor- 
mal density for 6; with mean @ and variance Xg. 

The MLE maximizes In Ly = $% J 3-1 Yi; In pi; with respect to 6 and Lg. The 
challenge is that there is no closed-form solution for the integral, whose dimension is 
given by the number of components of 3; that are random, with non-zero variance. 
Estimation is therefore by simulation methods. 

One approach is to approximate p;; using the direct simulator (see Section 12.4.1). 
This replaces the integral (15.41) by the average of S evaluations of the integrand 
at random draws of 3; from the N[G, = al distribution. The MSL estimator then 
maximizes 


e N m exh,” 
InLy(B, Zg)= X >> yin E D |: (15.42) 


m (s) 
i=l j=l s=l ah exh 


where B®, s=1,..., S, are random draws from the density ¢(6;; 6, &g). Since B 
and Xg are unknown, this summation is embedded in an iterative procedure with 
evaluation at B® and =e at the rth round. Consistency requires that S — oo as 


well as N —> oo and that VN/S —> oo (see Section 12.4.3). Methods for speeding 
up computation include use of Halton sequences (see Section 12.7.4) and alternative 
simulators. 

An alternative estimator uses Bayesian methods with relatively flat priors. Train 
(2001, 2003) specifies hierarchical priors with B ~ N[*, Q*], where Q* is assumed 
to be large, and with Xg assumed to be inverse-Wishart distributed with degrees of 
freedom K = dim[(] and scale parameter Ix. Rather than working with the pos- 
terior for just 8B and Xg it is computationally quicker to additionally include 6;, 
i=1,..., N. Then (1) the conditional posterior for B|Xg, B; is normal, (2) the con- 
ditional posterior for Xg|3, 3; is inverse Wishart, and (3) the conditional posterior for 
{3;| Xz, is B, which is proportional to the integrand in (15.41). Given these conditional 
posteriors estimation can be done using a variation of the Gibbs sampler (see Sec- 
tion 13.5.2), with the complication that draws for the third posterior need to use one 
iteration of the Metropolis—Hastings algorithm (see Section 13.5.4) because the full set 
of conditionals is not available. In an application this took similar computation time to 
the MSL estimator and, given the relatively flat prior, yielded parameter estimates and 
standard errors that were generally within 10% of those from MSL estimation. 
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15.7.3. Generalized Random Utility Models 


Models more flexible than multinomial logit are desirable. In this regard there is cur- 
rently great enthusiasm regarding the random parameters logit model. McFadden and 
Train (2000) show that any random utility model can be approximated arbitrarily well 
by a mixed model, though this result requires appropriate choice of regressors and 
mixing distribution. 

There is no reason to restrict the random parameters approach to multinomial logit 
models. For example, it may be extended to nested logit models. Moreover, additional 
sources of randomness may be incorporated, notably latent classes and latent variables. 

To present these extensions we begin with the ARUM (15.22). This specifies the 
utility to individual i of the jth alternative to be Uj; = Vij(xi, B) + £ij, where x; de- 
notes observed data, 8 denotes unknown parameters, and ¢;; denotes an error indepen- 
dent over i but possibly correlated over j. Assume that the distribution of ¢;; is such 
that (15.23) yields a closed-form solution for the choice probabilities denoted 


Pij = F (Vi (x, 9), 6.), 


where V;(x;, 6) = [Vi1(x;, B), ..., Vin(x;, B)] and @, denotes any unknown parame- 
ters of the distribution of €;= (€;;, ..., €im). Such a closed-form solution is possible 
if e; has a GEV distribution with special cases leading to multinomial logit and nested 
logit models. 

A more general model introduces additional randomness into this model. First, the 
previously deterministic part of utility becomes V;; = Vj;(x;, €;, 3). Then assuming 
that e; is such that a closed-form solution for the probabilities exist conditional on &;, 
unconditionally 


pe i: F (Vix, &, B), 82) f(E1Oe)d€;, (15.43) 


where f(€|@¢) denotes the density of €. The RPL model is an example with V;; = 
xX; + x;;€;, where €; is N{[0, £] and is motivated via a random parameters argument. 
However, €, can also be introduced as an additional disturbance term or as a relevant 
latent variable. Second, individuals may be assumed to come from one of C latent 
classes; see Section 18.5 for a duration model example and Swait (2003) for a GEV 
example of latent class or finite mixtures models. If G and 0e vary by class then (15.43) 
becomes unconditionally 


Ç 


p=}. | / FV E B°), 02) f(€ 188, Te, (15.44) 


c=1 


where z,, denotes probability of membership in the cth class and typically c = 2 or 
c = 3. The MSL estimator then maximizes 


Ss C 
InLy(G, -Sua SY ww (xi, €i, B°) , Oc) x ae 
i=l j s=1 a 


where €; denotes the sth draw from f(€;|@¢). Kamakura and Wedel (2004) estimate a 
finite mixtures MNL model using Bayesian methods. 
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Figure 15.1: Generalized random utility model. 


Walker and Ben-Akiva (2002) call such a model a generalized random utility 
model. They cite many articles with such extensions, consider the use of stated pref- 
erence data to supplement revealed preference data, and provide a substantial empir- 
ical illustration. Figure 15.1, derived from Walker and Ben-Akiva (2002), summarizes 
the various extensions. 

The multinomial modeling literature has been at the forefront of developing and 
estimating highly structured parametric models that incorporate random parameters, 
latent variables, and latent parameters and combine data from more than one source. 
These methods are applicable to any type of cross-section data, not just discrete 
outcomes. 


15.8. Multinomial Probit 


An alternative and obvious way to introduce correlation across choices in the unob- 
served component is to work with normally distributed errors. However, ML esti- 
mation is difficult as in the most general case an (m — 1)-fold integral needs to be 
calculated. 


15.8.1. Multinomial Probit Model 


The multinomial probit (MNP) model is an m-choice multinomial model, with utility 
of the jth choice given by 


U;=Vite;, j=1,2,...,m, (15.45) 
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where the errors are joint normally distributed, with 
e ~ N[O, £], (15.46) 


where the m x 1 vector € = [£]; . . . &m]'. Usually V; = xB or Vj = x'B;. 

Different MNP models arise from different specifications of the covariance matrix 
X. Some of the off-diagonal entries are specified to be nonzero, to permit correlation 
across the errors, though some restrictrictions need to be placed on X. Note that if the 
errors are uncorrelated the MNP still yields no closed-form solution for the probabili- 
ties and it is easier to assume instead that the errors are extreme value and use the CL 
or MNL models. 

Restrictions on È are needed to ensure identification. It is clear from (15.23) that, 
for any ARUM, choice is determined by the differences in utility or errors. Thus we 
consider the difference U; — U; between utility of alternative j and that of alternative 
1, chosen to be the benchmark alternative. Bunch (1991) demonstrated that all but 
one of the parameters of the covariance matrix of the errors €; — £1 is identified; see 
the discussion at the end of Section 15.5.1. One way to achieve this identification 
is to normalize €; = 0, say, and then restrict one covariance element. For example, 
if m = 2, set £; = 0 so oj; = 0 and op = 0 and additionally restrict o22 = 1. Then 
£2 — £1 = €2 ~ N[0, 1], which is the binary probit model. 

Additional restrictions on & or 8 may be needed for successful application. Keane 
(1992) demonstrated that even if assumptions on the error covariance are made to 
ensure just-identification, in practice the parameters of the MNP model may be highly 
imprecisely estimated in models with regressors that do not vary with the alternative. 
Further restrictions on the MNP model are then needed. This estimation imprecision is 
qualitatively similar to high multicollinearity among regressors in a linear regression. 
Keane found that exclusion restrictions on the regressors (with one exclusion for each 
utility index) work well. Alternatively, and more commonly, further restrictions may 
be placed on the covariance parameters. 

A popular parsimonious model for the errors is the factor model 


L 
ej =v; +) cjr, PH, 2; m; 
I=1 
where v; and &;, . . . , Ẹz are iid standard normal and cj; are weights called factor load- 


ings to be estimated. This model can greatly reduce the number of covariance parame- 
ters, from m(m + 1)/2 to L, and requires an (L + 1)-dimensional integral. Numerical 
methods, usually Gaussian quadrature, can be used for low values of L, whereas sim- 
ulation methods need to be used for larger L. For panel data the random effects model 
(see Section 21.2.1) can be viewed as a factor model with error u;; = a; + €;;, and the 
factor model may be especially appropriate in a panel probit setting. 


15.8.2. Estimation of Multinomial Probit 


The regression and error variance parameters are preferably estimated by ML with 
log-likelihood given in Section 15.3.2. The challenge is that there is no closed-form 
expression for the choice probabilities. 
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For a three-choice MNP model 


V31 Voy 
pi = Pr[y = 1] = i / f a1, €31)d€21d€31 
—oo —oo 


(see (15.24)), where f(€21, €31) is a bivariate normal with as many as two free co- 
variance parameters and Vai and V31 depend on regressors and parameters 3. This 
bivariate normal integral can be quickly evaluated numerically. More generally, how- 
ever, an m-choice model requires numerical evaluation of an (m — 1)-variate integral. 
A trivariate normal integral is the limit for numerical methods, limiting standard nu- 
merical integration methods to a four-choice MNP model. 

For larger models an alternative is to use simulation methods. For simplicity we 
refer to the three-choice MNP model. One possibility is to use the frequency sim- 
ulator that approximates pı by the fraction of draws of (€2;, €31) that are less than 
(Vas -y 1). From Section 12.7.1 this simulator is not smooth and it can be very in- 
efficient (see Section 12.7.2). Furthermore, in the current setting it is possible that it 
yields boundary values of pı = 0 or 1. In general it is better to use importance sam- 
pling, detailed in Section 12.7.2. For Monte Carlo integration over a region of the 
multivariate normal a very popular importance sampler is the GHK simulator, due to 
Geweke (1992), Hajivassiliou and McFadden (1994), and Keane (1994). This recur- 
sively truncates the multivariate normal pdf. Compared to the frequency simulator it 
is smooth, requires many fewer draws for alternatives with low probability of being 
chosen, and is unlikely to have boundary problems. Train (2003) provides a detailed 
account of this method. 

The preceding discussion considers evaluation of MNP probabilities assuming 
knowledge of @ and &. In fact we need to estimate G and £. The maximum sim- 
ulated likelihood estimator estimator maximizes 


N m 
InLy(B, E) = X X yy In Diy, 

i=l j=l 
where the p;; are obtained using the GHK or other simulator. Consistency requires the 
number of draws in the simulator S — oo as well as N —> oo. The method is very 
burdensome. At the rth round of an iterative procedure (see Chapter 10) the estimates 
are Bo and ©” and the update requires recalculating p;;, which requires S draws for 
each of N individuals. 

An alternative estimation procedure is the method of simulated moments 
(see Section 12.5). From (15.8) a consistent method of moments estimator solves 
pe a1 — pij)Z; = 9, where, for example, z; = x;. The corresponding MSM 
estimator of 8 and & solves the estimating equations 


N m 
Yo Oy -Piz = 0, 
i=l j=l 


where the p; j are obtained using an unbiased simulator. Then (yi; — Pi j)Zi is unbiased 
for (yij — pij)Zi, so consistent estimation is possible even if S = 1. This can greatly 
reduce computation. However, there is an efficiency loss for low S, and even for large 


518 


15.9. ORDERED, SEQUENTIAL, AND RANKED OUTCOMES 


S MSM is less efficient than MSL since in this example the method of moments is less 
efficient than ML. A less-used related method that is as efficient as MSL is the method 
of simulated scores (see Hajivassiliou and McFadden, 1998). 

An alternative estimator uses Bayesian methods. Unlike RPL there is no closed- 
form solution for the probabilities, which need to be derived from the utilities. The 
latent utilities U; = (Uj;,..., Uji) are introduced as auxiliary variables and the data 
augmentation approach (see Section 13.7) is used. Letting U = (U;,...,Uy) and 
y = (y1,.--, yy) we have that the Gibbs sampler cycles among (1) the conditional 
posterior for Bly, U, X, (2) the conditional posterior for X|y, B, U, and (3) the poste- 
rior for Uily, 6, X. Albert and Chib (1993) provide a quite general treatment for both 
unordered and ordered multinomial models. McCulloch and Rossi (1994) provide a 
substantive MNP application. Chib (2001) discusses the complication of imposing the 
restrictions on X needed for identification (see Section 15.8.1). 


15.8.3. Discussion 


Both MNP and RPL models lack a closed-form solution for p;;. However, for RPL 
there is at least a closed-form solution conditional on 8; and the only problem is to 
integrate out 3;. For the MNP model, which predates the RPL model, there is no such 
conditional result and approximating p;; becomes more challenging, especially if p;j 
is close to zero or one. It appears to be easier to get model flexibility through nested 
logit, RPL or mixture models than by use of MNP. 


15.9. Ordered, Sequential, and Ranked Outcomes 


In this section we present models with more structure than unordered models, such as 
those with a natural ordering of alternatives or sequencing of decisions. Analysis is 
straightforward as appropriate models are well established and estimation is again by 
MLE based on (15.4), with different models leading to different specifications of the 
probabilities p;;. 


15.9.1. Ordered Multinomial Models 


Suppose that there is a natural ordering of alternatives. For example, self-rated health 
status may be one of excellent, good, fair, or poor. Such data can be estimated by 
an unordered multinomial model, but a much more parsimonious model and sensible 
model is one that takes account of this ordering. 

The starting point is an index model, with single latent variable 


yy =x ß + üi, (15.47) 


where x here does not include an intercept, a departure from Section 14.4.1. As y* 
crosses a series of increasing unknown thresholds we move up the ordering of alter- 
natives. For example, for very low y* health status is poor, for y* > a; health status 
improves to fair, for y* > œz it improves further to good, and so on. 
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In general for an m-alternative ordered model we define 
y= Jj ifæj-a <y <a, (15.48) 
where a = —œ anda, = oo. Then 


Prly; = j] = Prlaj_1 < yf < aj] 
= Prla j—1 < x,O+ uj < aj] 


(15.49) 
= Prlaj-1 — x8 <u; < aj — x; 8] 


= F(a; — x; PB) — F(aj-1 — x; 6), 


where F is the cdf of u;. The regression parameters 3 and the (m — 1) threshold 
parameters @),...,@,,—; are obtained by maximizing the log-likelihood (15.5) with 
pij defined in (15.49). For the ordered logit model u is logistic distributed with 
F(z) = e7 /(1 + e*). For the ordered probit model u is standard normal distributed 
and F(-) is the standard normal cdf. Letting K denote the number of regressors ex- 
cluding the intercept, an m-choice ordered model has K + m — 1 parameters whereas 
an MNL model has (m — 1)(K + 1) parameters. 

The sign of the regression parameters G can be immediately interpreted as deter- 
mining whether or not the latent variable y* increases with the regressor. For marginal 
effects in the probabilities 


ð Pr[y; = j] 
Ox; 


= {F'(aj_-1 — x,B) — F’'(aj — x, B)}B, 


where F’ denotes the derivative of F. The term in braces can be positive or negative. 

This model can also be applied to count data that take just a few values. Cameron 
and Trivedi (1986) applied the ordered probit model to number of doctor consultations. 
Hausman, Lo, and MacKinley (1992) applied the ordered probit to data on changes 
in a count, which can be negative, and additionally modeled the error term u; to be 
heteroskedastic. 


15.9.2. Sequential Multinomial Models 


In some situations decisions are made sequentially. For example, one might first de- 
cide whether or not to go to college. If no college is chosen then y = 1. If y Æ 1 
then decide whether to go to a two-year college (y = 2) or four-year college (y = 3). 
Given specification of this sequence the probabilities are easily obtained. For exam- 
ple, model the first decision by a probit model and the second decision, if relevant, 
by a probit model. Then Pr[y = 1] = (x) 3,) and Pr[y = 2|y # 1] = (x462). The 
unconditional probability is 


Pr[y = 2] = Prly = 2ly # 1] x Prly # 1] = D46) — &(%,B,)). 


The parameters 3, and 8, can be estimated by maximizing the log-likelihood function 
(15.5), where pi; = ®(x);G,), pai is given in the preceding equation, and p3; = 1 — 
Pii — P21. 
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This approach relies on correct specification of the sequence of decision making. A 
better model for this choice example may be a three-choice nested logit model where 
the errors in the utilities for two-year college and four-year college are correlated with 
each other and independent of the error in the utility for no college. These models can 
be compared using the likelihood-based methods given in Section 8.5. 


15.9.3. Ranked Data Models 


The models discussed thus for have assumed that alternatives are mutually exclusive 
and only one alternative is chosen. More generally, alternatives may be ranked, espe- 
cially with stated preference data. For example, both first and second choices may be 
known. 

The rank-ordered logit model is simple to estimate (see Beggs, Cardell, and 
Hausman, 1981). Consider a four-alternative conditional logit model with alternative 
2 the first choice and alternative 3 the second choice. Alternative 2 is chosen from all 
four alternatives and then alternative 3 is chosen from the remaining alternatives 1, 3, 
and 4. The joint probability of these first and second choices is 


exin8 exi38 


I I J J x I J J s 
ex + e*nb + eX3F 4 e*ub ex + eXi38 + exiaB 
Estimation is by ML given similar expressions for the other 11 joint probabilities. 
For the multinomial probit model there is no similar simplification. Hajivassiliou 
and Ruud (1994) present a method to simulate the joint probabilities; they use the 
rank-ordered probit model to illustrate a variety of simulation-based estimators. 


15.10. Multivariate Discrete Outcomes 


The preceding models, aside from rank-ordered models, are models for one discrete 
dependent variable that takes one of m mutually exclusive values. Now we consider 
models when there is more than one discrete outcome. The log-likelihood function 
is similar to (15.5) for the multinomial model, with different models corresponding 
to different functional forms for the probabilities. These probabilities may need to 
account for correlation of the two outcomes and possibly simultaneity. 


15.10.1. Bivariate Discrete Outcomes 


For simplicity consider bivariate discrete data ()j;, y2;). For example, in a joint 
model of labor supply and fertility the dependent variables (y1;, y2;) for individual 
i may be yı; = 2 if work and yı; = 1 do not work, and yz; = 2 if have children and 
yo; = 1 if have no children. 

More generally, yı may take values 1,...,m, and y2 may take values 1,..., mo. 
For individual i define 


Pijk = Pr[yu = j, yu = k], FS lym, k= 1,..., m2. (15.50) 
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Note that p;;« define probabilities of mutually exclusive events and ` j a ee 
Define mı x m2 corresponding binary indicator variables yj, = 1 if (y1 = j, y2 = k) 
and yj, = 0 otherwise. Then the joint density for the ith observation is 


m m 


ou y= [] ] rit. 
k=1 j=l 


The log-likelihood is then 7”, 7, 21 Vise In piję and estimation is by ML as in 
Section 15.4.2. 

The essential difference between the multivariate and multinomial models is in the 
specification of the functional form for the probabilities. 

In the simplest case the two discrete dependent variables are independent and piją = 
Pr[yy; = j] x Pr[yz = k]. Then yı and yz can be modeled using separate multinomial 
models. 

If the two variables are instead viewed as interrelated, a simple approach is to use a 
multinomial logit model for the probabilities p;jx. Then the bivariate outcomes (y1, y2) 
are essentially treated as mı x m2 univariate outcomes. For example, in the labor sup- 
ply and fertility example one of the four outcomes is then work and have children. 

In the next section we consider models between these two extremes. 


15.10.2. Bivariate Probit 


The bivariate probit model is a joint model for two binary outcomes that generalizes 
the index function model (see Section 14.4.1) from one latent variable to two latent 
variables that may be correlated. 

Define the unobserved latent variables 


yi =x bı +41, (15.51) 
yz; = X, b2 + &, 


where the £; and £2 are joint normal with means zero, variances one, and correlation 
p. Then the bivariate probit model specifies the observed outcomes to be 


[2 cea, 
"=li ae yee 0, 


_f2 ifyš>0, 
PES AS ify <0, 


where we use values (2, 1) rather than (1, 0) to be consistent with the notation of this 
chapter. This model collapses to two separate probit models for yı and y2 when the 
error correlation p = 0. 
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When p + 0 there is no closed-form solution for the probabilities. For example, 


pz = Pr[yi = 2, y2 = 2] 
= Pr[yi > 0, yz > 0] 
= Pr [-e1 < xi Bi, —&2 < x) 35 | 


= Pr [e < x| Bi, & < x562] 


x, 3; xX, 3 
s f / P(Z1, 22, P)dzıdz2 
—0o —0o 


= xi bi, X53, p), 


where (Zz, z2, 0) and ®(z1, z2, p) are, respectively, the standardized bivariate normal 
density and cdf for (z1, z2) with zero means, unit variances, and correlation p, and the 
fourth equality holds for the bivariate normal with mean zero. 

Performing similar algebra for the other possible outcomes yields 


Pye = Prim = j, y2 =k] 
= O(q1X) 31, q2X5>, P), 


where q; = 1 if y; = 2 and q; = —1 if y; = 1 for l = 1, 2. This is the basis for ML 
estimation, detailed in Greene (2003), who also considers computation of marginal 
effects. 

Implementation requires evaluation of a bivariate normal integral, which is numer- 
ically feasible. Generalizations to multivariate probit are obvious though will experi- 
ence numerical challenges because of higher order integrals. If each outcome is or- 
dered then the model can be generalized to a bivariate ordered probit model. 

One can also consider a simultaneous equations probit model that generalizes 
(15.51) to allow the right-hand side variables to be endogenous. For example, the first 
equation for yf may include y% and/or yz as regressors and similarly for yž, with some 
restrictions required to ensure the model is identified. This model is similar to the 
simultaneous equations Tobit model discussed in Section 16.8.2. 


15.11. Semiparametric Estimation 


Some studies have extended semiparametric estimation methods to models for un- 
ordered multinomial data. Abe (1999) estimated the conditional logit model with x; B 
in (15.10) replaced by the additive model form ` p Bp fp(%ijp), where p denotes the 
pth component of x;; and the function f,(-) is estimated by the data. L-F. Lee (1995) 
extended the Klein and Spady (1993) estimator (see Section 14.7) from binary out- 
comes to multinomial outcomes. Semiparametric methods for multiple-index models 
can also be applied to the multinomial unordered model. The challenge is to ensure 
that predicted probabilities lie between zero and one and sum to one. 

Ordered models lend themselves well to semiparametric analysis since they involve 
an index x’@ that crosses a number of thresholds. See, for example, Klein and Sherman 
(2002), who present an estimator that is V N-consistent and asymptotically normal for 
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both regression and threshold points up to location and scale, under the assumption 
that errors are independent of regressors. 


15.12. Derivations for MNL, CL, and NL Models 


We consider the conditional and multinomial logit models, deriving first and second 
derivatives of the log-likelihood function and expressions for the effect of changes in 
regressors on the probabilities. Then the nested logit (NL) model is derived from the 
GEV model. 


15.12.1. Conditional Logit 


The conditional logit probability is p;i; = e%iP DD e%ib. Differentiation by parts 
yields 


ðpij eXiiP e*iP up 
a — Xi; e*iiP x, 
3B yen? z £, eiB)? 2 i 


= PijXij — Pij > PilXil = PijXij — Pijži = Pij®ij — Xi), 
i 


where X;= a PitXi1. Then 


OL 2o, Yij ee E EE 7 pyy -X&)= SOE vii ii — Xj). 
j i i j 


l 


It follows that 


5 ag = Lag 


_ == 0 Dna ~ 
ij 
=n 5 > Vij > pu Xi — Ki)Xy 
i j l 

= > X pis iy — Š;)X;j 

i j 
= BS SS pis iy — X); — Xi)’, 

Bey 


which is (15.15). The second to last equality uses the fact that y;; equals one for ex- 
actly one of the choices and zero otherwise, so that $` jYij Yan = 5 j X yuan = 
yi aij, and the last equality uses = Pij&ij — 35, = | (PijXij — piX)X, = 
| — PiZX)¥; = Oas `; pj = 1. 
Now consider the effect of changing regressors. For the conditional logit model 
OD ij exh exh 


= 7 B 
Oxi; J eb (£, ex) 


5 eB = pj — piP, 
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whereas for j Æ k 


= J zP B = — pij Pikß. 


Combining these two results yields (15.18). 


15.12.2. Multinomial Logit 


The multinomial logit probability is p;; = e xP; J $; e“ßi, Differentiation by parts 
yields 
Opi; eB; eX; , 
T= 7g Xi e™PiXi = PijXi — Pij PijXi» 


3B; Zeh £, ebi)? 
whereas for k 4 j 


OD ij = ei; 
3B; È ein)?” 


x! 
eX Pex, = — pij DikXi- 


Combining we have 
ODP ij 
dB, 


where the indicator variable — jk = lif j =k, and 


ara oy Yij ðPij 
ag ea dB, 


Pij 


> RÈ yu -OPi — Pij PikXi) 


= 5 z Yijôijk — wpa] Xi 
i j 

= X Da — Ppik]Xi, 
i 


as stated in (15.16), where the last line uses the definition of 6; ;, and Èa yij = 1. For 
the second derivative we have 


api ; 
Te a EN ie Pi x p= — DD pii — padan, 


which yields (15.17). 
When regressors change 


Opij ex; e¥ibi y 
=~, MAp 
Ox; Ža exi bi J © eX)? 2 ; 


= pijßbj — Pij 5 pußı = pij(B; — Bi), 
l 


= OijkDijXi — Pij PikXi = Pij Oijk — Pik)Xi, 


where 3;= >=, Piuß as stated in (15.19). 
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15.12.3. Nested Logit 
We consider the two-level GEV model given by (15.32) and (15.33) with 


Pj 
G(Y) = G(Yu,..-, Yik essy Vn... , Yjk,) a (per) 


which is a generalization of (15.34) owing to the coefficients aj. The general GEV 
result (15.31) becomes Pr[ yj, = 1] = Y;kG jk/G(Y), where Gj, is the derivative of 
G(Y) with respect to Yj, and evaluation is at Yj, = eVi, 


Now 
pj—1 
aG(Y) L ijp ijja 
G jk — — aj ard x me P 
an de 


which gives 


Then 
Ki l/o NPT] yl lp 
Yin G jx aj (SS Yi r) Vix i 
GY) v 1/Pm 
( ) Y nzi am ( l= a Y y" 


The probability of choosing limb j is 
1 y/oa” 
aj ( in 1 Y; ‘ ) 


Kj 
Pi =) Pik = 
K; 1/Pm 
k=] a 14m (Xi Ži Ypi y" 


after some simplification, and the conditional probability of choosing branch k given 
limb j is 


Pjk = 


1/pj 
Pjk = Y ig i 


Pj 7 i p l 
This result is also given in Maddala (1983, p. 72). 
We need to evaluate these expression at Y; = exp(V;,). Suppose 


Pkj = 


Vi = Zia + xB j- 
Then performing some algebra yields 


(evn)? = = exp (z, a/pj;) exp bj ;/ Pi), 


De ee = exp (z, a/p;) exp(/;), 


Kı ih Pj 
(È (e!) $ = exp (za + p;l;), 
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where 
Kı 
I; = n( exp aaie) . 
i=1 
It follows that the probability of choosing limb j becomes 
K ALNA 
aj ( ae)” ) 
; 1/Pm Pm 
Tacit (S (em) ) 
aj exp (za + pili) 


ae am (exp (za + Pm In)) 


as stated for the first term in (15.36). Note that the scalar a; can be absorbed into Zz; as 
a limb-specific dummy, as a; exp(z aœ + p;l;) = exp(Ina;j + za + p;1;). Without 
loss of generality we therefore set a; = 1. 

The probability of branch k within limb j is 


j= 


eV)” 
ep (Hae) x0 (anero) 
i A exp (z,a/p;) exp (x/,8/0;) 
exp (x,.;/0:) 
j pea exp (x;,8/0;) 


as Stated for the second term in (15.36). 


15.13. Practical Considerations 


The multinomial logit model is adequate for describing data or estimating the marginal 
probabilities but is viewed as a poor model if a more structural interpretation of the pa- 
rameters is required, owing to the independence of irrelevant alternatives assumption. 
Many packages estimate the multinomial logit model. 

The nested logit model can be estimated in STATA and by using the NLOGIT add- 
on to LIMDEP, and it is easy to code in a language such as GAUSS. It is the obvious 
model to use if there is an obvious nesting structure, but usually there is no obvious 
structure. 

The random parameters logit model requires special code in a language such as 
GAUSS and requires use of the simulation-based estimation methods given in Chap- 
ter 12. Ken Train provides code at his Web site elsa.berkeley.edu/~train. 

The multinomial probit model is even more challenging to estimate, for more than 
four choices, and has met with relatively little empirical success. For these reasons the 
random parameters logit model is currently preferred. 
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15.14. Bibliographic Notes 


15.3 Good basic references for multinomial models include Amemiya (1981, 1985), Maddala 
(1983), and Greene (2003). The books by Ben-Akiva and Lerman (1985), Train (1986), 
and Borsch-Supan (1987) provide extensive applications as well as a review of theory. 
Train (2003) presents an outstanding treatment of unordered multinomial models and on 
estimation using simulation methods. 

15.5 The seminal article by McFadden (1981) provides an advanced treatment of discrete 
choice modeling, emphasizing the random utility model approach. For welfare analysis 
see Small and Rosen (1981), Train (2003, pp. 59-61) and Dagsvik and Karström (2004). 

15.6 Borsch-Supan (1987) gives an excellent exposition and application of the nested logit 
model. 


15.7 The random parameters logit model and other recent advances are well covered in Train 
(2003). Revelt and Train (1998) provide an early application. 
15.8 Bolduc (1999) presents MSL estimation of a nine-choice multinomial probit model. 
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Exercises 


Consider a latent variable modeled by y* = x’3 + £, with € ~ NTO, 1]. Suppose 
we observe only y= 2if y* <a, y=1ifa < y* < U,and y= O if y* > U, where 
the upper limit U is a known constant for each individual (i.e., data) and may 
differ over individuals, but œ is unknown. 

(a) Obtain the conditional probabilities that y= 0, y= 1, and y=2. 

(b) Provide details on a method to consistently estimate G and a. 

Use a 50% subsample of the fishing mode choice data of Section 15.2. 

(a) Estimate the conditional logit model of Section 15.2.1. 

(b) Comment on the statistical significance of parameter estimates. 

(c) What is the effect of an increase in price on the various modes of fishing? 
Use a 50% subsample of the fishing mode choice data of Section 15.2. 

(a) Estimate the multinomial logit model of Section 15.2.2. 


(b) Comment on the statistical significance of parameter estimates. 
(c) What is the effect of an increase in income on the various modes of fishing? 


Use a 50% subsample of the fishing mode choice data of Section 15.2. Suppose 

we collapse the model to three alternatives and order the alternatives, with y= 0 

if fishing from a pier or beach, y= 1 if fishing from a private boat and y= 2 if 

fishing from a charter boat. 

(a) Estimate an ordered logit model with income as the only regressor. 

(b) Provide an interpretation of the estimated coefficient. 

(c) Compare the fit of this model with that from a three-choice multinomial 
model with income as the regressor. 
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Tobit and Selection Models 


16.1. Introduction 


In this chapter we consider two closely related topics: regression when the depen- 
dent variable of interest is incompletely observed and regression when the dependent 
variable is completely observed but is observed in a selected sample that is not rep- 
resentative of the population. This includes limited dependent variable models, latent 
variable models, generalized Tobit models, and selection models. 

All these models share the common feature that even in the simplest case of pop- 
ulation conditional mean linear in regressors, OLS regression leads to inconsistent 
parameter estimates because the sample is not representative of the population. Alter- 
native estimation procedures, most relying on strong distributional assumptions, are 
necessary to ensure consistent parameter estimation. 

Leading causes of incompletely observed data are truncation and censoring. For 
truncated data some observations on both the dependent variable and regressors are 
lost. For example, income may be the dependent variable and only low-income people 
are included in the sample. For censored data information on the dependent variable is 
lost, but not data on the regressors. For example, people of all income levels may be in- 
cluded in the sample, but for confidentiality reasons the income of high-income people 
may be top-coded and reported only as exceeding, say, $100,000 per year. Truncation 
entails greater information loss than does censoring. A leading example of truncation 
and censoring is the Tobit model, named after Tobin (1958), who considered linear 
regression under normality. Similar issues arise for truncation and censoring in other 
models introduced in later chapters, most notably for censored duration data presented 
in Chapter 17. More generally, truncation and censoring are examples of missing data 
problems that are studied in Chapter 27. 

The first-generation estimation methods require strong distributional assumptions. 
Even seemingly minor departures from assumptions, such as heteroskedastic errors 
when homoskedastic errors are assumed, can lead to inconsistent parameter estimates. 
For this reason the models presented in this chapter provide a leading econometrics 
application of semiparametric regression methods. Semiparametric methods for simple 
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forms of censoring and truncation such as top-coding have been successfully applied. 
However, for more general models with selection on unobservables there is to date no 
widely accepted procedure. 

Section 16.2 presents general theory for censored and truncated nonlinear regres- 
sion models, with specialization to the Tobit model given in Section 16.3. An alterna- 
tive model for censored data, the two-part model, is introduced in Section 16.4. The 
sample selection model is presented in Section 16.5. An application to health expen- 
ditures in Section 16.6 contrasts the two-part and sample selection models. The Roy 
model for unobserved counterfactuals is presented in Section 16.7. Section 16.8 con- 
siders fully structural models obtained by utility maximization with corner solutions 
or by extension of simultaneous equation models to selected samples. Semiparametric 
estimation is presented in Section 16.9. 


16.2. Censored and Truncated Models 


We present general methods for estimation of fully parametric models when data are 
censored or truncated. These methods can be applied to models presented in later 
chapters such as count and duration models. The leading example, the Tobit model 
for censoring or truncation in linear models, is introduced in Section 16.2.1 and given 
separate treatment in Section 16.3. 


16.2.1. Censoring and Truncation Example 


Let y* denote a variable that is incompletely observed. For truncation from below, y* 
is only observed if y* exceeds a threshold. For simplicity, let that threshold be zero. 
Then we observe y = y* if y* > 0. Since negative values do not appear in the sample, 
the truncated mean exceeds the mean of y*. For censoring from below at zero, y* is 
not completely observed when y* < 0, but it is known that y* < 0 and for simplicity 
y is then set to 0. Since negative values are scaled up to zero, the censored mean 
also exceeds the mean of y*. Clearly, sample means in truncated or censored samples 
cannot be used without adjustment to estimate the original population mean. 

This chapter studies similar issues for regression models. With luck, truncation and 
censoring might lead only to a shift up or down in the intercept, leaving slope coeffi- 
cients unchanged; however, this is not the case. For example, if EL y*|x] = x’@ in the 
original model then truncation or censoring leads to E[y|x] being nonlinear in x and 
B so that OLS gives inconsistent estimates of @ and hence inconsistent estimates of 
marginal effects. 

As an illustration we consider the following labor supply example with simulated 
data. The relationship between desired annual hours worked, y*, and hourly wage, w, 
is specified to be of linear-log form with data-generation process 


y* = —2500 + 1000In w + €, (16.1) 
e ~ NTO, 10007], 
In w ~ N[2.75, 0.607]. 
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Tobit: Censored and Truncated Means 


. Actual Latent Variable 
---------= Truncated Mean 


Censored Mean 


Different Conditional Means 
0 
1 


Uncensored Mean 


Natural Logarithm of Wage 


Figure 16.1: Tobit regression of hours on log wage: uncensored conditional mean 
(bottom), censored conditional mean (middle), and truncated conditional mean (top) for 
censoring/truncation from below at zero hours. Data are generated from a classical linear 
regression model. 


This is a Tobit model, studied in detail in Section 16.3. The model implies that the 
wage elasticity is 1000/y*, which equals, for example, 0.5 for full-time work (2,000 
hours). For each 1% increase in wage, annual hours increase by 10 hours. 

Figure 16.1 presents a scatter plot of y* and In w for a generated sample of 200 
observations. The unconditional mean for y*, which is —2500 + 1000 In w, is given 
by the lowest curve, which is a straight line. 

With censoring at zero, negative values of y* are set to zero because people with 
negative desired hours of work choose not to work. For this particular sample this 
is the case for about 35% of the observations. This pushes up the mean for low 
wages, since the many negative values of the y* are shifted up to zero. It has little 
impact for high wages, since then few observations on y* are zero. The middle curve 
in Figure 16.1 gives the resulting censored mean, using the formula given later in 
(16.23). 

With truncation at zero the 35% of the population with negative values of y* are 
dropped altogether. This increases the mean above the censored mean, since zero 
values are no longer included in the data used to form the mean. The upper curve 
in Figure 16.1 gives the resulting truncated mean, using the formula given later in 
(16.23). 

It is clear that censored and truncated conditional means are nonlinear in x even 
if the underlying population mean is linear. OLS estimation using truncated or cen- 
sored data will lead to inconsistent estimation of the slope parameter, since by vi- 
sual inspection of Figure 16.1 a linear approximation to the nonlinear truncated and 
censored means will have flatter slope than that for the original untruncated mean. 
Analysis should instead be based on the formulas for the censored or truncated condi- 
tional mean. Unfortunately these are based on strong distributional assumptions, as we 
will see. 
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16.2.2. Censoring and Truncation Mechanisms 


As is customary for regression analysis, we let y denote the observed value of the 
dependent variable. The departure from usual analysis is that y is the incompletely 
observed value of a latent dependent variable y*, where the observation rule is 

y =80*), 


for some specified function g(-). Leading examples of g(-) immediately follow. 


Censoring 


With censoring we always observe the regressors x, completely observe y* for a subset 
of the possible values of y*, and incompletely observe y for the remaining possible 
values of y*. If censoring is from below (or from the left), we observe 


y* ify >L 
= 16.2 
4 t ify* < L. Ue) 


For example, all consumers may be sampled with some having positive durable goods 
expenditures (y* > 0) and others having zero expenditures (y* < 0). If censoring is 
from above (or from the right) we observe 


y=? YEL (16.3) 


For example, annual income data may be top-coded at U = $100,000. This form of 
censoring is called type 1 censoring in the duration literature (see Section 17.4.1). 

The incompletely observed observations on y* are set to L or U for simplicity. 
More generally, we require that for incompletely observed observations y* is known 
to be missing (i.e., we observe that y* lies outside the relevant bound) and regressors 
x continue to be completely observed. 


Truncation 


Truncation entails additional information loss as all data on observations at the bound 
are lost. With truncation from below we observe only 


y=y ify*>L. (16.4) 


For example, only consumers who purchased durable goods may be sampled (L = 0). 
With truncation from above we observe only 


y=y ify* <U. (16.5) 


For example, only low-income individuals may be sampled. 


Interval Data 


Interval data are data recorded in intervals. Survey data are often collected in this 
way to aid recall and to provide some greater anonymity in responses to more personal 
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questions. For example, income may be reported in intervals of $10,000 and then top- 
coded at $100,000. Such data are censored at multiple points, with the observed data 
y being the particular interval in which the unobserved y* lies. 


16.2.3. Censored and Truncated MLE 


Censoring and truncation are easily dealt with if the researcher applies a fully para- 
metric approach. This may be the case with interval data or top-coded data where, for 
example, it may be reasonable to assume a log-normal distribution for earnings or a 
negative binomial model for number of doctor visits. 

If the conditional distribution of y* given regressors x is specified, then the parame- 
ters of this distribution can be consistently and efficiently estimated by ML estimation 
based on the conditional distribution of the censored or truncated y. Specifically, let 
f*(y*|x) and F*(y*|x) denote the conditional probability density function (or prob- 
ability mass function) and cumulative distribution function of the latent variable y*. 
Then one can always obtain f(y|x) and F(y|x), the corresponding conditional pdf and 
cdf of the observed dependent variable y, since y = g(y*) is a transformation of y*. 

The limitation of the parametric approach is its reliance on strong distributional 
assumptions. For example, for the linear regression model under normality the MLE 
remains consistent even if the errors are nonnormal, but the censored MLE becomes 
inconsistent if the errors are nonnormal (see Section 16.3.2). More flexible models and 
semiparametric methods are presented in later sections. 


Censored MLE 


Censoring and truncation change both the conditional mean and the conditional den- 
sity. We begin with the density. 

Consider ML estimation given censoring from below. For y > L the density of y is 
the same as that for y*, so f(y|x) = f*(y|x). For y = L, the lower bound, the density 
is discrete with mass equal to the probability of observing y* < L, or F*(L|x). Thus 
for censoring from below 


FOlx) ify >L, 


S= | F*(LIx) ify= L. 


As mentioned after (16.3), setting y = L when y* < L is not necessary. Even if no 
value of y is observed when y* < L the density is still F*(L|x). 

The density is a hybrid of the pdf and cdf of y*. Similar to analysis for binary 
outcome models, it is notationally convenient to introduce an indicator variable 


1 ify>L, 
= 16. 
| 0 ify=L (18:0) 
Then the conditional density given censoring from below can be written as 
FOW = Ol FL). (16.7) 
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For a sample of N independent observations, the censored MLE maximizes 


N 
nLy(0) = $ {din f*OilXi 8) + (1 — di) In F*(L;|x;, 0)}, (16.8) 
i=l 

where @ are the parameters of the distribution of y*. For generality the censoring lower 
bound L; is permitted to vary across individuals, though usually L; = L. The censored 
MLE is consistent and asymptotically normal, provided the original density of the 

uncensored variable f*(y*|x, 0) is correctly specified. 
When censoring is instead from above, the log-likelihood is similar to (16.8), 
except now d= 1 if y < U and d =0 otherwise, and F*(L|x, 0) is replaced by 
1 — F*(U|x, 0). A leading example is right-censored duration data (see Section 17.4). 


Truncated MLE 


For truncation from below at L, and suppressing dependence on x, the conditional 
density of the observed y is 


fo) = FOly > L) 
= f*(y)/Prbyly > L] 
= f*(y)/[l — F*(L)]. 


The truncated MLE therefore maximizes 
N 


InLy(@) = X {In f*(yi|x:, 9) — [l — F*(Lilx;, ON} (16.9) 


i=1 
If instead truncation is from above, the log-likelihood is (16.9), except that 1 — 
F*(L|x, 0) is replaced by F*(U|x, 0). 

Ignoring censoring or truncation leads to inconsistency. For example, if truncation 
is ignored the MLE maximizes }°; In f*(y;|x;, 0), which is the wrong likelihood func- 
tion as it drops the second term in (16.9). Consistency of the censored and truncated 
MLE requires correct specification of f(-), which in turn requires correct specifica- 
tion of the latent variable density f*(-). Even if f*(-) is an LEF density (see Section 
5.7.3), the density, and not just the mean, must be correctly specified if censoring or 
truncation are present. 


Interval Data MLE 


Suppose the latent variable y* is only observed to lie in the (J + 1) mutually exclusive 
intervals (—oo, a1], (a1, a2], ..., (az, OO), Where a1, a2, ..., aj are known. Then since 


Prlaj < y* < aj4i1] = Pr[y* < aj41] — Pr[y* < aj] 
= F*(aj41) — F*(aj), 
the interval data MLE maximizes 
N J 
InLy(@) = J} dij In[ F*(aj411x;, 9) — F*(aj|x;, 0)], (16.10) 


i=l j=0 
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where the d;;, j =0,..., J, are binary indicators equal to one if y;; € (aj, aj+ı] and 
zero otherwise. This is similar to an ordered probit or logit model (see Section 15.9.1), 
except here the interval boundaries a1, ..., a; are known. 


16.2.4. Poisson Censored and Truncated MLE Example 


Assume that y* is Poisson distributed, so that f*(y) =e “/y! and In f*(y) = 
—u + yln pu — ln y!, with mean u = exp(x' 8). 

Suppose the number of visits to a health clinic is modeled, but data are only avail- 
able for people who visited the health clinic. Then the data are truncated from below 
at zero and we only observe y = y* if y* > 0. Then F*(0) = Pr[y* < 0] = Pr[y* = 
0] = e`”, and from (16.9) the truncated MLE for B maximizes 


N 


InLy(8) = $ {—exp(x,8)+y;x;8— In y;! — In[1 — exp(— exp(x;8))]} . 


i= 


Suppose instead that data are censored from above at 10 because of top-coding, so 
that we observe y = y* if y* < 10 and that y = 10 if y* > 10. Then Pr[y* > 10] = 
1 — Pr[y* < 10] = 1 — D f*(k). From (16.8) the censored MLE for B maximizes 


N 
InLy(8) =} la [- exp(x; 8) + yix; 6 — In y;!] 
i=l 


k=0 


9 
+(1 — di) In > e “sek | . 


In both cases the resulting first-order conditions are considerably more complicated 
than those for the Poisson MLE without truncation or censoring. Also, in both cases 
ignoring the truncation or censoring and maximizing the original density leads to in- 
consistent parameter estimates. 


16.2.5. Censored and Truncated Conditional Means 


Censoring and truncation change the conditional mean. 

For example, consider the Poisson truncated from below at zero. The truncated den- 
sity is f*(y)/[1 — F*(0)], y = 1, 2,...., so the truncated mean is 77°, kf*(k)/[1 — 
F*(0)] = Veo kf (K)/UL — F*(0)] = u/(1 — e™). Thus 


E[y|x] = exp(x’)/[1 — exp(— exp(x’))], 


rather than exp(x’ 6) if there were no truncation. 

This expression for E[y|x] can be used for NLS estimation. There is little advantage 
to NLS rather than ML estimation, however, as given truncation the NLS estimator 
relies on distributional assumptions that are essentially as strong as those needed for 
consistency of the more efficient ML estimator. 
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16.3. Tobit Model 


Truncation and censoring arise most often in econometrics in the linear regression 
model with normally distributed error, when only positive outcomes are completely 
observed. This model is called the Tobit model after Tobin (1958), who applied it to 
individual expenditures on consumer durable goods. The model in practice is usually 
too restrictive. It is nonetheless presented in some detail, as it provides the basis for 
more general models presented in subsequent sections of this chapter. 


16.3.1. Tobit Model 


The censored normal regression model, or Tobit model, is one with censoring from 
below at zero where the latent variable is linear in regressors with additive error that is 
normally distributed and homoskedastic. Thus 


y=x Bre, (16.11) 
where the error term 


e€ ~ N[0, 07] (16.12) 


has variance ø? constant across observations. This implies that the latent variable y* ~ 


N [x’B, 07]. The observed y is defined by (16.2) with L = 0, so 
y= e eas (16.13) 


where — means that y is observed to be missing. No particular value of y is necessarily 
observed when y* < 0, though in some settings such as durable goods expenditures 
we observe y = 0. 

Equations (16.11) — (16.13) define the prototypical Tobit model analyzed by To- 
bin (1958). More generally, Tobit models begin with (16.11) and (16.12) for the latent 
variable but can have other censoring mechanisms including censoring from above, 
censoring from both below and above (the two-limit Tobit model), and interval- 
censored data. The results in this section are restricted to the censoring mechanism 
given in (16.13). The models of later sections are sometimes called generalized Tobit 
models. 

The normalization L = 0 is not only natural in many settings, but some such nor- 
malization is necessary for a linear model with intercept and constant threshold pa- 
rameter L. Then we observe y if y* > L, or equivalently if 6, + x462 +€ > L or 
(Bı — L) + x}, + £ > 0. Thus only the difference (6; — L) is identified. More gen- 
erally, the latent model y* = x/@+e with variable censoring threshold L = x'y is 
observationally equivalent to the latent model y* = x’(G — y) + £ with fixed thresh- 
old L = 0. These results are a consequence of censoring arising in a linear model with 
additive error and do not carry over to nonlinear models, such as the preceding Poisson 
example. 
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Applying the general expression (16.7) for the censored density, here f*(y) is the 
N [x’B, o7] density and 


F*(0) = Pr[y* < 0] 

= Pr[x 86 + e <0] 
= 0(-x'3/c) 

= 1 — 0(x'B/o), 


where Ẹ (-) is the standard normal cdf and the last equality uses symmetry of the 
standard normal distribution. Thus the censored density can be expressed as 


1 1 f z d xB l-d 
roal pE e 


where the binary indicator di is defined in (16.6) with L = 0. 
The Tobit MLE 0 = B, G?) maximizes the censored log-likelihood function 
(16.8). Given (16.14) this becomes 


N 
1 1 1 
InLy(G,o7) = >> la ( ee e -5 (1 xa) (16.15) 


+a-am(1-e(22))} 
o 


a mixture of discrete and continuous densities. The first-order conditions are 


dinky _y 1 (a. _y ~ ĝi fas 
eee (a0 x/B) ~ (I ay) xj =0 (16.16) 


2 
dInLy Š 1. (i -*8) xB 1 
= i 1 i i = ’ 
do? 3 la ( 20? $ 204 RIESE Gi — ;) 203 


using 0®(z)/dz = (z) where $(-) is the standard normal pdf, and with the definitions 
bi = (x; 3/o) and ©; = (x; B/o). As usual @ is consistent if the density is correctly 
specified, that is, if the dgp is (16.11) and (16.12) and the censoring mechanism is 
(16.13). The MLE is asymptotic normal distributed with variance matrix given in, for 
example, Maddala (1983, p. 155) and Amemiya (1985, p. 373). 

Tobin (1958) proposed ML estimation of the Tobit model and asserted that the usual 
ML theory applied. Amemiya (1973) provided a formal proof that the usual theory 
did apply, despite the mixed discrete—continuous nature of the censored density. The 
appendix of this classic paper of Amemiya details the asymptotic theory for extremum 
estimators presented in Section 5.3. 


537 


TOBIT AND SELECTION MODELS 


If data are truncated, rather than censored, from below at zero then the Tobit MLE 
em <<) ~ . x . . . 
0 = (B , 0°)! maximizes the truncated normal log-likelihood function 


N 


InLw(B.07) = >| Ino? l In2x sa x/ 3)” ino (x\9/0)} 


i=l 


(16.17) 
obtained using (16.9) for y* distributed as in (16.11) and (16.12). 


16.3.2. Inconsistency of the Tobit MLE 


A very major weakness of the Tobit MLE is its heavy reliance on distributional as- 
sumptions. If the error ¢ is either heteroskedastic or nonnormal the MLE is inconsis- 
tent. 

This can be seen from the ML first-order conditions (16.16), which are a quite 
complicated function of variables including d;, yi, ¢;, and ®;. The first equation in 
(16.16) satisfies E[d In Ly /dG] = 0, a necessary condition for consistency (see Sec- 
tion 5.3.7), if 


Eldi] = 9;, 
Eldi yi] = $x; B + o Qi. 


These moment conditions can be shown to hold if the dgp is (16.11) and (16.12) and 
the censoring mechanism is (16.13). However, they are unlikely to hold under any 
other specification of the dgp, as they rely heavily on both normality and homoskedas- 
ticity. For example, with heteroskedastic errors the estimator is inconsistent, since then 
Eld;] = x; 6/0;) 4 ©; unless of = 0°. 

Consistent estimation with heteroskedastic normal errors is possible by specifying 
a model for heteroskedasticity, say o? = exp(z;~y). For censoring from below at zero 
the log-likelihood In Ly(G, y) is that given in (16.15) with o? replaced by exp(z;7). 
Consistency then requires normal errors and correct specification of the functional 
form of the heteroskedasticity. 

Clearly, with censoring or truncation, distributional assumptions become important 
even for distributions somewhat robust to misspecification in the uncensored or un- 
truncated case. Specification tests for the Tobit model are discussed in Section 16.3.7. 
In many censored data applications the Tobit model is not appropriate. More general 
models presented in subsequent sections of this chapter are instead used. 


16.3.3. Censored and Truncated Means in Linear Regression 


Censoring and truncation in the linear regression model (16.11) lead to observed de- 
pendent variable y that has distribution with conditional mean other than x’, condi- 
tional variance other than o? even if £ is homoskedastic, and distribution that is nonnor- 
mal even if ¢ is normally distributed. We present general results for linear regression 
in this section before specializing to normally distributed errors in Sections 16.3.4— 
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16.3.7. The results provide additional insights regarding the consequences of trunca- 
tion and censoring and form the basis for non-ML estimation methods presented in 
later sections. 

We begin with the truncated mean. The effects of truncation are intuitively pre- 
dictable. Left-truncation excludes small values, so the mean should increase, whereas 
with right-truncation the mean should decrease. Since truncation reduces the range of 
variation, the variance should decrease. 

For left-truncation at zero we only observe y if y* > 0. If we suppress dependence 
of expectations on x for notational simplicity, the left-truncated mean becomes 


Ely] = E[y*|y* > 0] (16.18) 
= E[x' 8 + elx'ß +e > 0] 
= E[x'Bix'B + £ > 0] +E[elx'6 +e > 0] 
=x3+E[ele > —x'8], 


where the second equality uses (16.11), and the last equality assumes ¢ is independent 
of x. As expected the truncated mean exceeds x’, since E[e|e > c] for any constant c 
will exceed E[e]. 

For data left-censored at zero suppose we observe y = 0, rather than merely that 
y* < 0. The censored mean is obtained by first conditioning the observable y on the 
binary indicator d defined in (16.6) with L = 0 and then unconditioning. Suppressing 
dependence on x for notational simplicity again, we have the left-censored mean 


Ely] = EalEyjaly|d]] 
= Pr[d = 0] x E[y|d = 0] + Pr[d = 1] x E[y|d = 1] 
= 0 x Pr[y* < 0] + Pr[y* > 0] x E[y*|y* > 0] 
= Pr[y* > 0] x E[y*|y* > 0], 


(16.19) 


where Pr[y* > 0] = 1 — Pr[y* < 0] = Pr[e > —x’G] is one minus the censoring 
probability and E[y*|y* > 0] is the truncated mean already derived in (16.18). 

In summary, for the linear regression model with censoring or truncation from be- 
low at zero, the conditional means are given by 


latent variable: E[y*|x] = x’B 
left-truncated (at 0): ELy|x, y > 0] = x’ 86 + E [ele > —x'6] : (16.20) 
left-censored (at 0): ELy|x] = Pr[e > —x' 8] {x’B +E [ele > —x'6]} ; 


It is clear that even though the original conditional mean is linear, censoring or trun- 
cation leads to conditional means that are nonlinear so that OLS estimates will be 
inconsistent. 

One possible approach to take is a parametric one of assuming a distribution for €. 
This leads to expressions for E[ele > —x’ A] and Pr[e > —x’@] and hence the trun- 
cated or censored conditional mean. We do this in the next section for normally dis- 
tributed errors. 
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Inverse Mills Ratio as Cutoff Varies 


Inverse Mills, pdf and cdf 


Cutoff point c 


Figure 16.2: Inverse Mills ratio for the standard normal distribution as the censoring or 
cutoff point c increases. Standard normal cdf and density also plotted. 


A second approach seeks to avoid or minimize such parametric assumptions. We 
consider this in a later section, but note here that regardless of the distribution for ¢ the 
truncated mean is a single-index model with correction term decreasing in x’ since 
E[ele > —x’ 3] is a monotonically decreasing function in x’G. 


16.3.4. Censored and Truncated Means in the Tobit Model 


For the Tobit model the regression error £ is normal and we use the following result, 
derived in Section 16.10.1. 


Proposition 16.1 (Truncated Moments of the Standard Normal): Suppose 
z ~ N[0, 1]. Then the left-truncated moments of z are 


(i) E[z|z > c] = &(c)/[1 — B(c)], and Elz|z > —c] = o(c)/P(c), 
(ii) Elz2|z > c] = 1+ ch(c)/[1 — ®(c)], and 
(iii) V[zlz > c] = 1 + cġ(c)/[1 — &(©)] — o/11 — OP 


Result (i) of Proposition 16.1 is shown in Figure 16.2. We consider truncation of 
z ~ N[0, 1] from below at c, where c ranges from —2 to 2. The lowest curve is the 
standard normal density #(c) evaluated at c. The middle curve is the standard normal 
cdf ®(c) evaluated at c and gives the probability of truncation when truncation is at c. 
This probability is approximately 0.023 at c = —2 and 0.977 at c = 2. The upper curve 
gives the truncated mean E[z|z > c] = ¢(c)/[1 — ®(c)]. As expected this is close to 
E[z] = 0 for c = —2, since then there is little truncation, and E[z|z > c] > c. What 
is not expected a priori is that d(c)/[1 — P(c)] is approximately linear, especially for 
c > 0. Moments when truncation is from above can be obtained using, for example, 
E[z|z < c] = —E[—z| — z > —c] = —¢(c)/®(c). 
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Appling this result to (16.18), the error term has truncated mean 


E[ele > —x’6] =o [£18 > =£] (16.21) 
= o¢(—*%)/[1 — &(—*4)] 
= oo ČP [EEP 
= oACČÊ), 


where the second line uses Proposition 16.1, the third line uses symmetry about zero 
of (z), and we define 


a) 


Mz) = ba)’ 


(16.22) 


We follow the definition and terminology of Amemiya (1985) and many others in 
defining A(-) as in (16.22) and calling it the inverse Mills ratio. From Johnson and 
Kotz (1970, p. 278), Mills actually tabulated the ratio (1 — ®(z))/@(z) whose in- 
verse $(z)/[1 — ®(z)] = o(z)/®(—z) is the hazard function of the normal distribu- 
tion. Some authors therefore instead write (16.21) as E[ele > —x’ | = 0A*(—x’B/o), 
where A*(z) = $(z)/®(—z) is referred to as the inverse Mills ratio. 

Also, Pr[e > —x' 8] = Pr[—e < xB] = Pr[—e/o < x’B/o] = ®(x'B/o). Then 
the conditional means in (16.20) specialize to 


latent variable: ELy*|x] = x’B, (16.23) 
left-truncated (at 0): ELy|x, y > 0] = xB + oA(x'/o), 
left-censored (at 0): ELy|x] = ®(x'B/o)x'B + o(x'B/o). 


The variance is similarly obtained (see Exercise 16.1). Defining w = x’G/o, we have 


latent variable: Viy*|x] = o?, (16.24) 
left-truncated (at 0): V[y|x, y > 0] = o? [1 — wàÀ(w)— À (wy ; 
left-censored (at 0): V[y|x] = a7 ®(w) {w? + waA(w)+ 1 — O(w)[w + Awl}? F 


Clearly truncation and censoring induce heteroskedasticity, and for truncation 
V[y|x] < o? so that truncation reduces variability, as expected. 

These results assume normal errors. Maddala (1983, p. 369) gives results similar 
to Proposition 16.1 for the log-normal, logistic, uniform, Laplace, exponential, and 
gamma distributions. 


16.3.5. Marginal Effects in the Tobit Model 


The marginal effect is the effect on the conditional mean of the dependent variable 
of changes in the regressors. This effect varies according to whether interest lies in 
the latent variable mean x’@ or the truncated or censored means given in (16.23). 
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Differentiating each with respect to x yields 


latent variable: dELy*|x]/ox = GB, (16.25) 
left-truncated (at 0): ƏE[y, y > O|x]/dx = {1 — waA(w) — Mwy}, 
left-censored (at 0): dE[y|x]/dx = ®(w)G, 


where w = x’G/o and we use 0®(z)/dz = p(z) and d¢(z)/dz = —zG(z). The sim- 
ple expression for the censored mean is obtained after some manipulation. It can be 
decomposed into two effects, one for y = 0 and one for y > 0 (see McDonald and 
Moffitt, 1980). 

In some cases truncation or censoring is just an artifact of data collection, so the 
truncated and censored means are of no intrinsic interest and we are interested in 
dE[y*|x]/dx = . For example, with top-coded earnings data we are clearly inter- 
ested in measuring the effect of schooling on mean earnings rather than earnings of 
those not top-coded. 

In other cases truncation or censoring has behavioral implications. In a model for 
hours worked, for example, the three marginal effects in (16.25) correspond to the 
effect of a change in a regressor on, respectively, (1) desired hours of work, (2) actual 
hours of work for workers, and (3) actual hours of work for workers and nonworkers. 
For (1) we clearly need an estimate of 6, but for (2) and (3) OLS slope coefficients, 
although inconsistent for B, may actually provide a reasonable crude estimate of the 
marginal effect since the truncated and censored means are still fairly linear in x. 


16.3.6. Alternative Estimators for the Tobit Model 


In addition to the MLE, consistent estimation is possible by NLS based on the correct 
expression for the truncated or censored mean. We consider the NLS estimator and 
other least-squares estimators. 


NLS Estimator 


The results in (16.23) can be used to permit consistent estimation of the Tobit model 
parameters by NLS. For example, with truncated data we minimize 


N 


Sv(B, 07) = X` (yi — 1B — oX(x,B/o))” 


i=l 


with respect to both @ and o°, but then perform inference controlling for the het- 
eroskedasticity given in (16.24). A similar estimator can be obtained for censored data. 

This estimator is not used in practice. Consistency requires correct specification of 
the truncated mean, which from (16.21) requires both normality and homoskedasticity 
of the errors. One might as well estimate by ML since this relies on assumptions just as 
strong and is fully efficient. Moreover, in practice the NLS estimator can be imprecise. 
From Figure 16.2 it is clear that A(x’G/o) is approximately linear in x’G/o, leading 
to near collinearity because x is also a regressor. In Section 16.5 we consider models 
that permit correction terms similar to 0A(x'3/o) in (16.23) that have the advantage 
of depending in part on regressors other than those in x. 
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Heckman Two-Step Estimator 


From (16.23) the truncated (at zero) mean is 
ELy|x] = x B + oA(x'B/o). (16.26) 


Rather than use NLS, this can be estimated in the following two-step procedure if 
censored data are available. First, for the full sample do probit regression of d on x, 
where the binary variable d equals one if y > 0 is observed, to give consistent estimate 
Q, where œ = (3/0. Second, for the truncated sample do OLS regression of y on x and 
A(x'@) to give consistent estimates of 3 and oc. 

This estimation procedure, due to Heckman (1976, 1979), is presented in Sec- 
tion 16.5.4 where it is applied to the more general sample selection model. Section 
16.10.2 derives the standard error of B that accounts for the regressor A(x’@) depend- 
ing on estimated parameters and for heteroskedasticity induced by truncation. 


OLS Estimation of the Tobit Model 


The OLS estimates using censored or truncated data are inconsistent for G. This is be- 
cause the censored and truncated means given in (16.23) are not equal to x’, violating 
the essential condition for consistency of OLS. 

For censored data, OLS provides a linear approximation to the nonlinear censored 
regression curve. It is clear from Figure 16.1 and (16.25) that this line is flatter than the 
regression line for uncensored data, which has slope equal to the true slope parameter. 
Goldberger (1981) showed analytically that if y and x are joint normally distributed 
and there is censoring from below at zero, then the OLS slope parameters converge 
to p times the true slope parameter, where p is the fraction of the sample with posi- 
tive values of y. These conditions are restrictive but were relaxed somewhat by Ruud 
(1986). In practice this proportionality result provides a good empirical approximation 
to the inconsistency of OLS if a Tobit model is instead appropriate. 

Similarly, with truncation the regression line is flatter than the untruncated regres- 
sion line. Goldberger (1981) obtained an analytical result similar to that for the cen- 
sored case. If y and x are joint normally distributed and there is censoring from below 
at zero, then the OLS slope parameters converge to a multiple of the true slope pa- 
rameter. The multiple, the expression for which is quite lengthy, lies between zero and 
one, and the shrinkage is the same for all slope coefficients. Truncated OLS therefore 
understates the absolute magnitude of the true slope parameters. 


16.3.7. Specification Tests for the Tobit Model 


Given the fragility of the Tobit model it is good practice to test for distributional mis- 
specification. There are four broad strategies. 

The first approach is to nest the Tobit model within a richer parametric model and 
apply a Wald, LR, or LM test. Since the null hypothesis model, the Tobit model, is 
most easily estimated it is natural to use LM tests. This is particularly straightfor- 
ward for testing against heteroskedasticity of the form o? = exp(x;a.) in the censored 


oe 
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regression model. Using the OPG form of the LM test (see Section 7.3.5) we com- 
pute N times the uncentered R? from auxiliary regression of 1 on S4; and Sz;, where 
fi = f(yilXi, B, œ) is the density given in (16.14) with ø replaced by exp(x’a), the 
expressions for sı; = ð ln f;/0@ and sz; = ə ln f;/da are obtained by minor adapta- 
tion of the expressions in (16.16), and tilde denotes evaluation at the censored Tobit 
MLE with all components of œ except that for the intercept equal to zero. A similar 
approach for testing the assumption of normally distributed errors is more difficult as 
there is no standard generalization of the normal. 

A second approach is to use conditional moment tests (see Section 8.2) that do not 
require specification of an alternative hypothesis model. In particular, the first-order 
conditions (16.16) for the censored Tobit MLE suggest conditional moment tests based 
on the generalized residual 

x3 Qi 


y= 
229, eed 
£ g2 i) Saat 


If the Tobit model is correctly specified then E[e;|x;] = 0 since the regularity con- 
ditions imply that E[d In f (y:)/3 6] = 0. Then we can implement an m-test of Ho: 
E[ez] = 0 against H, : Efez] £0 using N` ESON | ĉiZi, where & = e; evaluated at 
the Tobit MLE (B, G7). From Section 8.2.2 this test can be implemented by comput- 
ing N times the uncentered R? from auxiliary regression of 1 on @;z;, Sı, and Sz, 
where f; = FOUR B, 07) is the density given in (16.14) and sı; = 0 In f;/d8 and 

= ð ln f;/d07 given in (16.16) are evaluated at (B, °). The variables z; may be 
vaiiables other than x;, in which case the test can be interpreted as a test of omitted re- 
gressors, or powers of the components of x;. Conditional moment tests based on higher 
order moments have also been developed. For details see Chesher and Irish (1987) and 
Pagan and Vella (1989). 

A third approach is to adapt some of the diagnostic and testing methods developed 
for right-censored duration data (see Chapter 19) to left-censored normally distributed 
data. 

A final approach contrasts the Tobit MLE B with alternative estimates of 6, no- 
tably the semiparametric estimates presented in Section 16.9, that are consistent under 
weaker distributional assumptions. 

For further details see Pagan and Vella (1989), who present theory with some ap- 
plication, and Melenberg and Van Soest (1996), who provide a more complete appli- 
cation. Both papers consider specification tests for the richer sample selection model 
(see Section 16.5) in addition to those for the Tobit model. 


16.4. Two-Part Model 


The preceding models for censored data restrict the censoring mechanism to be from 
the same model as that generating the outcome variable. More generally, the censoring 
mechanism and outcome may be modeled using separate processes. For example, in 
explaining individual annual hospital expenses one process may determine hospital- 
ization and a second process may explain consequent hospital expenses. The case for 
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postulating two separate mechanisms is strong if there is compelling reason to believe 
that certain realized values occur with too large or too small a frequency than is con- 
sistent with a simpler model. For example, one might observe many more zeros than is 
consistent with, for example, the Poisson distribution. A two-part model that permits 
the zeros and non-zeros to be generated by different densities adds flexibility. Indeed 
it is a specific type of mixture model. 

There are two approaches to such generalization. The two-part model, given in this 
section, specifies a model for the censoring mechanism and a model for the outcome 
conditional on the outcome being observed. The sample selection model, presented in 
the subsequent section, instead specifies a joint distribution for the censoring mecha- 
nism and outcome, and then finds the implied distribution conditional on the outcome 
observed. These approaches are contrasted in Section 16.5.7. 


16.4.1. Two-Part Model 


Let an individual with fully observed outcome be called a participant in the activity 
being studied. Define a binary indicator variable d = 1 for participants and d = 0 for 
nonparticipants. Suppose that y > 0 is observed for participants and y = 0 is observed 
for nonparticipants. For nonparticipants we observe only Pr[d = 0]. For participants 
the conditional density of y given y > O is specified to be f(y|d = 1), for some choice 
of density f(-). The two-part model for y is then given by 


Pr[d = 0|x] ify =0, 


Prid = I |x] f(vld = 1, x) ify > 0. (19:2) 


FOX) = | 

This model was presented in detail by Cragg (1971) as a generalization of the Tobit 
model, which can be presented as a special case of (16.27). An obvious model for the 
participation decision d is a probit or logit model. A latent variable formulation is that 
d = 1 if I = x'ß + exceeds zero, and the model is then viewed as a hurdle model 
since crossing a hurdle or threshold leads to participation. To ensure positive values for 
the participants, the density f(y|d = 1, x) should be that for a positive-valued random 
variable, such as the log-normal, or an appropriate density such as the normal truncated 
from below at zero. 

For simplicity the same regressors usually appear in both parts of the model, but 
this can be relaxed and should be if there are obvious exclusion restrictions. Maximum 
likelihood estimation is straightforward as it separates into estimation of a discrete 
choice model using all observations and estimation of the parameters of the density 
f(y|d = 1, x) using only observations with y > 0. 


16.4.2. Two-Part Model Examples 


Duan et al. (1983) present a leading application of this model to forecasting medi- 
cal expenses using data from the Rand Health Insurance Experiment. They specified 
a probit model for whether or not any medical expenses were incurred during the 
year, so Pr[d = 1|x] = ® (x; B 1)» and a log-normal model for medical expenses given 
that some expenses were incurred, so In y|d = 1, x ~ Nx} 62, 03]. Then expected 
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medical expenses over the entire population are given by 
E[y|x] = © (x; 3,) explo; /2 + x53], (16.28) 


where the second term uses the result that if In y ~ N[u, 07] then E[y] = exp(u + 
o7/2). Mullahy (1998) considers such retransformation in further detail. 

Two-part models are especially popular for modeling count data. For example, in 
modeling the number of doctor visits there is one model to determine whether or not 
a patient visits a physician at all and a second model to determine the consequent 
number of visits for those with at least one visit. Then Pr[d = 1] is specified to be 
the probability that a Poisson or negative binomial variable exceeds zero, whereas the 
density f(y|d = 1) is specified to be a Poisson or negative binomial density truncated 
from below at zero. This model, due to Mullahy (1986), is called a hurdle model in the 
count literature and is detailed in Section 20.4.5. 

For continuous data two-part models are used for expenditure models with excess 
zeros (Cragg’s original motivation). An alternative, a sample selection model, is pre- 
sented next. 


16.5. Sample Selection Models 


Sample selection can arise in many setttings and so there are many sample selection 
models. This section begins with a general discussion of sample selection before focus- 
ing on a leading example, the bivariate sample selection model studied by Heckman 
(1979). Another leading example, the Roy model, is treated separately in Section 16.7. 


16.5.1. Sample Selection Models 


Observational studies are rarely based on pure random samples. Most often exogenous 
sampling is used (see Section 3.2.4) and the usual estimators can be applied. If instead 
a sample, intentionally or unintentionally, is based in part on values taken by a depen- 
dent variable, parameter estimates may be inconsistent unless corrective measures are 
taken. Such samples can be broadly defined as selected samples. 

There are many selection models, since there are many ways that a selected sample 
may be generated. Indeed it is very easy to be unaware that a selected sample is being 
used. For example, consider interpretation of average scores over time on an achieve- 
ment test such as the Scholastic Aptitude Test, when test taking is voluntary. A decline 
over time may be due to real deterioration in student knowledge. However, it may just 
reflect the selection effect that relatively more students have been taking the test over 
time and the new test takers are the relatively weaker students. 

Selection may be due to self-selection, with the outcome of interest determined in 
part by individual choice of whether or not to participate in the activity of interest. 
It can also result from sample selection, with those who participate in the activity of 
interest deliberately oversampled — an extreme case being sampling only participants. 
In either case, similar issues arise and selection models are usually called sample se- 
lection models. 
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This chapter presents only three of the many selection models in the literature. The 
simplest model is the Tobit model already presented in Section 16.3. A prototypical 
commonly used model that we call the bivariate sample selection model is presented in 
the remainder of this section. This model generalizes the Tobit model by introducing 
a censoring latent variable that differs from the latent variable generating the outcome 
of interest. Another popular model called the Roy model is presented in Section 16.7. 
This model considers an outcome that takes one of two values depending on the value 
taken by a censoring random variable. These models correspond to, respectively, the 
Tobit model types 1, 2, and 5 in the terminology of Amemiya (1985, p. 384). 

Consistent estimation in the presence of sample selection on unobservables relies 
on relatively strong distributional assumptions, even in the case of semiparametric es- 
timation. Experimental data studies provide an attractive alternative as selection prob- 
lems can then be avoided by random assignment. However, experiments can be diffi- 
cult to implement in economics applications for cost and ethical reasons. The treatment 
effects approach, detailed in Chapter 25, seeks to apply the experimental approach to 
observational data. 


16.5.2. A Bivariate Sample Selection Model (Type 2 Tobit) 


Let y; denote the outcome of interest. In the standard truncated Tobit model this 
outcome is observed if y; > 0. A more general model introduces a different latent 
variable, yj, and the outcome y; is observed if yf > 0. For example, yř determines 
whether or not to work and y; determines how much to work, and y Æ y3 since there 
are fixed costs to work such as commuting costs that are more important in determining 
participation than hours of work once working. 

The bivariate sample selection model comprises a participation equation that 


1 ify*>0, 
= 16.29 
yı | 0 ify <0 ( ) 
and a resultant outcome equation that 
_ |» ify > 0 l 
n= {> if y* <0. (1029) 


This model specifies that y2 is observed when y; > 0, whereas y2 need not take on 
any meaningful value when yf < 0. The standard model specifies a linear model with 
additive errors for the latent variables, so 


yi =x bi +1, (16.31) 
y = X58) + €2, 


with problems arising in estimating 6, if £; and €2 are correlated. The Tobit model is 
clearly the special case where yj = y3. 

There is no generally accepted name for this model. Heckman (1979) used it to 
illustrate estimation given sample selection. The model is equivalent to a Tobit model 
with stochastic threshold (Nelson, 1977). Suppose we observe y3 if y; > L*, where 
yž is defined as in (16.31) and the threshold is L* = z'y + v rather than L* = 0 in 
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Section 16.3. Then, equivalently, we observe y% if yf > 0, where yf = yj — L* = 
(x, By — zy) + (e2 — v) = xB, + £1 and where x; denotes the union of x, and z, and 
Bı and £; are defined in an obvious manner. Amemiya (1985, p. 384) calls the model 
a type 2 Tobit model. Wooldridge (2002, p. 506) calls the model one with a probit 
selection equation. Others call this model the generalized Tobit model or the sample 
selection model, though there are many such models. 

Estimation by ML is straightforward given the additional assumption that the cor- 
related errors are joint normally distributed and homoskedastic, with 


E1 (0) 1 012 
EJE [gw 


As for the probit model in Section 14.4.1, the normalization o7 = 1 is used since only 
the sign of yř is observed. 

Given (16.29) and (16.30), for yf > 0 we observe yž, with probability equal to 
the probability that yý > 0 times the conditional probability of y; given that yf > 0. 
Thus for positive y2 the density of the observables is f*(yž|yř > 0) x Pr[yř > 0]. 
For yř < 0 all that is observed is that this event has occurred, and the density is the 
probability of this event occurring. The bivariate sample selection model therefore has 
likelihood function 


L= [| {Pett < Ol}! {Fu yi > O x Prly*> OP", (16.33) 


i=l 


where the first term is the discrete contribution when yj, < 0, since then yı; = 0, and 
the second term is the continuous contribution when y;, > 0. This likelihood function 
is applicable to quite general models, not just linear models with joint normal errors. 

Specializing to linear models with joint normal errors gives a bivariate density 
f* (7, y3) that is normal, leading to a conditional density in the second term that is 
univariate normal and easily handled. Amemiya (1985, pp. 385-387) provides details, 
including the exact form of the likelihood function. 

The classic early application of this model was to labor supply, where yř is the un- 
observed desire or propensity to work, whereas y2 is actual hours worked. The model 
is also conceptually more appealing for labor supply than the Tobit model in Section 
14.2.1 which required the artifice of “desired” hours of work. This prototypical ap- 
plication does have the complication that data on a key regressor, the offered wage, 
is missing for those individuals who do not work. This complication is handled by 
adding an equation for the offered wage and substituting this in, though the model is 
then strictly speaking not just a bivariate sample selection model. See Mroz (1987) for 
an excellent application to labor supply. 


16.5.3. Conditional Means in the Bivariate Sample Selection Model 


In this section we obtain the conditional truncated mean in the bivariate sample selec- 
tion model. It differs from x3, so that OLS regression of y2 on x2 leads to inconsis- 
tent parameter estimates. Nonetheless, the expression for the conditional mean can be 
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used to motivate an alternative estimation procedure given in the subsequent section 
that relies on weaker distributional assumptions than those of the MLE. 

We consider the truncated mean in the sample selectivity model where only positive 
values of yz are used. In general this is 


E[y2|x, yf > 0] = E[x)@, + €2|x| 6; + €1 > 0] 
= x58, + Elesle, > -x B81], 


where x denotes the union of x; and xp. If the errors £; and £2 are independent then 
the last term simplifies to E[e2] = 0, and OLS regression of y2 on x2 will give a con- 
sistent estimate of 3,. However, any correlation between the two errors means that the 
truncated mean is no longer x3, and we need to account for selection. 

To obtain E[e2|e; > —x1 61] when £; and ¢ are correlated, Heckman (1979) noted 
that if the errors (£1, €2) in (16.31) are joint normal as in (16.32) then Equation (16.36) 
in the following implies that 


(16.34) 


£2 = 012£1 + Ẹ, (16.35) 


where the random variable € is independent of £1. To obtain this result, note that in 
general the joint normal distribution 


Z Xi X12 
H 7 x] E =| 
implies the conditional normal distribution 
221 ~ N [m + Lai En (a — oy), E2 — Ea Ep ZX], 
a result that implies that 
Zp = My + Dai Dy (Zi — My) +E, (16.36) 


where € ~ N'[0, Xo. — Xp, Sa X12] is independent of zı. For the joint density given 
in (16.32) we have scalars and u = u2 = 0 and o? = 1, so (16.36) specializes to 
(16.35). 

By using (16.35), the truncated mean (16.34) becomes 


E[y2|x, y} > 0] = x46, + E [(@281 + €) ler > —x} G1] 
= x53) + 02Ele|e1 > —x) 61], 


where we use independence of £ and ¢;. The selection term is similar to that in the 
simpler Tobit model and again using the expression for E[z|z > —c] in Proposition 
16.1 we obtain 


Elyolx,y¥ > 0] = x58) + oå (x161), (16.37) 


where A(z) = #(z)/®(z) and we have used o? = 1. Similarly, Proposition 16.1(iii) 
yields the truncated variance 


Vixx, y* > 0] = oF — oA x 6DE B + A(x, B,)). (16.38) 


The preceding analysis specifies no value for y2 when yf < 0. In some applications 
y2 may equal zero when y; < 0. Then it is meaningful to consider the censored mean. 
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Conditioning the observable yz on the unobservables y; and y% and then uncondition- 
ing yields 
E[y2|x] = Ey [E[y2|x,y7]] 

= Pr[yj < 0|x] x 0+ Pr[yf > O|x] x ELy3|x,y>; > 0] 

= 0 + ®(x'B,) {x55 + 012A (x, 31)} 

= B(x) 8) )x,3.+0129 (x31), 
where the third line uses (16.37) and the last line uses A (z) = $(z)/®(z). The censored 
variance can be shown to be heteroskedastic. 


(16.39) 


16.5.4. Heckman Two-Step Estimator 


An important result is that OLS regression of y2 on xz alone using just the observed 
positive values of yz leads to inconsistent estimation of 6 unless the errors are uncor- 
related so that 0,2 = 0. This is clear from the truncated mean formula (16.37), which 
additionally includes the “regressor” A(x‘ 61). 

Heckman’s two-step procedure, sometimes called the Heckit estimator, aug- 
ments the OLS regression by an estimate of the omitted regressor 1(x‘ 61). Thus using 
positive values of y2 estimate by OLS the model 


Yi = Xj By + o120(X) 31) + i, (16.40) 


where v is an error term, B | is obtained by first-step probit regression of yı on x; since 
Prly; > 0] = ®(x/G,), and Mx’ 3; J= px B )/ (x, B, ) is the esas inverse 
Mills ratio. This regression does not directly provide an Parimate of o2, but the trun- 
cated variance formula (16.38) leads to estimate o> = N7 ae o; + GN; (x, 3, + 
X )], where ù; is the OLS residual from (16.40) sid ce = = AX; Âı). The correlation 
between the two errors in (16.32) can then be estimated by P = 012/02. 

A test of whether or not o;2 = 0 or p = 0 is a test of whether or not the errors are 
correlated and sample selection correction is needed. One such test is a Wald test based 
on G19, the estimated coefficient of the inverse Mills ratio. 

It is important to note that both the usual OLS standard errors and 
heteroskedasticity-robust standard errors reported from the regression (16.40) are in- 
correct. Correct formulas for the standard errors take account of two complications 
in the second-stage regression. First, even if G, were known, the error in (16.40) is 
heteroskedastic from (16.38). Second, in fact 3, is replaced by an estimate, a com- 
plication studied in Section 6.6 and analyzed in Section 16.10.2 for the simpler Tobit 
model. Formulas for the correct standard errors are given in Heckman (1979); see also 
Greene (1981). Section 16.10.2 derives these formulas for the simpler Tobit model. 
Implementation is not simple so it is best to use a package that automatically handles 
this complication or to use the bootstrap. 

The resulting estimator of B, is consistent. Despite an efficiency loss compared to 
the MLE under joint normality of the errors that can be quite large, the estimator is 
very popular for the following reasons: (1) It is simple to implement; (2) the approach 
is applicable to a range of selection models including those given in Section 16.7; 
(3) the estimator requires distributional assumptions weaker than joint normality of £1 
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and £72; and (4) these distributional assumptions can be weakened even further to permit 
semiparametric estimation as in Section 16.9. 
The key assumption needed is (16.35), essentially that 


e2 Shey +E, (16.41) 


where & is independent of ¢,. This seems to be a quite sensible model. In the case of 
expenditures on a durable good, say, this says that the error in the expenditure equation 
is a multiple of the error in the purchase decision equation, plus some noise that is 
independent of the purchase decision; essentially a linear regression model for the 
errors. Given assumption (16.41) the conditional mean (16.34) becomes 


Elyzlyt > 0] = x48, + ôElei |e: > —x/ 91]. (16.42) 


If €; is standard normal distributed this leads to (16.37), the basis for the OLS regres- 
sion (16.40). 

More generally, Heckman’s two-step method can be applied to (16.42) with distri- 
butions for ¢; other than normal; see, for example, Olsen (1980). One can also use 
semiparametric methods that do not impose a functional form for E[e;|e; > —x/ 61] 
(see Section 16.9). 


16.5.5. Identification Considerations 


The bivariate sample selection model with normal errors is theoretically identified 
without any restriction on the regressors. In particular, exactly the same regressors 
can appear in the equations for y and y;. 

The model with normally distributed errors is close to unidentified, however, if ex- 
actly the same regressors are used. If x; = x2 then E[y2|y{ > 0] ~ x58, + a + bx), 
using (16.37) and the observation from Section 16.3.2 that the inverse Mills ratio term 
à (-) is approximately linear over a wide range of its argument. This leads to obvi- 
ous multicollinearity problems, discussed in many articles including those by Nawata 
(1993), Nawata and Nagase (1996), and Leung and Yu (1996). Multicollinearity can 
be detected using the condition number given in Section 10.4.2, where from (16.40) 
the regressors are x, and Mx, 3 ). The problem is less severe the greater the variation 
in x; 'B, across observations, that is, the better a probit model can discriminate between 
participants and nonparticipants. 

Semiparametric variants of the Heckman two-step method (see Section 16.9.3) do 
require an exclusion restriction. So identification in the bivariate sample selection 
model with normal errors is being achieved by functional form assumptions. 

For practical purposes therefore, estimation of the bivariate sample selection model 
may require that at least one regressor in the participation equation (yf) be excluded 
from the outcome equation (y;). For example, fixed costs of working unrelated to 
hours worked will affect the decision to work but not hours worked. This can be a 
great limitation as in many applications, such as that in Section 16.6, it can be very 
difficult to make defensible exclusion restrictions. 
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16.5.6. Marginal Effects 


The marginal effects in the bivariate sample selection model vary according to whether 
we consider the latent variable mean or the truncated mean given in (16.37) or the 
censored mean (if it is appropriate). 

It is convenient to define x to be the vector formed by union of x; and x, and 
rewrite x, 6; as x’y, and x3, as x'y,. For example, the truncated mean becomes 
E[y2|x] = X'y + 012A(x'y,). Note that y; and/or y, will have some zero entries if 
Xı Æ X2. Differentiating with respect to x yields the marginal effects 


uncensored: JE[y3|x]/dx = %7, (16.43) 
truncated (at 0): JE[y2|x,y1 = 1]/0x = ya—0o AX y (XY +A) 
censored (at 0): dE[y2|x]/0x = VOAY X Yo + PAYN 

=X NPY) 


where A(z) = o(z)/P(z), and we use d¢(z)/dz = —zd(z) and dA(z)/dz = 
—zo(z)/®(z) — &(z)?/P(z)* = —A(z(z + A(z)). Interpretation of these three deriva- 
tives is similar to that discussed in some detail in Section 16.3.5. As already noted, 
analysis of the censored mean is appropriate only if y2 takes the value of zero when 
yı = 0. In applications such as the log-normal health expenditures example discussed 
later there is no censored mean. 


16.5.7. Selection on Observables and on Unobservables 


There are many modeling situations that can be considered a two-part decision prob- 
lem of first engaging in an activity and then determining the level of the activity. These 
decisions are intertwined and can be expected to depend on common factors. The nat- 
ural model for such data is the bivariate selection model (16.29)—(16.31). 

After inclusion of regressors any remaining error (£; and £2) in the two processes 
may in some cases be uncorrelated. For example, for models of hospitalization it is 
possible that, after controlling for observed individual characteristics such as health 
status, there is no correlation between the error in the equation determining hospital 
admission and in the error in the equation determining length of hospital stay. In that 
case analysis is straightforward as selection is only based on observables since, for 
example, (16.37) simplifies when o12 = 0. The two pieces can be modeled separately 
and the simpler two-part model of Section 16.4 can be used. 

In other cases the errors may be correlated even after inclusion of the regressors. 
For example, in labor supply unobserved factors that make someone more likely to 
work may also make them more likely to work longer hours than would be predicted 
by the observable regressors. One can test whether there is such correlation between 
the errors. If there is correlation, then selection is on unobservables and the methods of 
this chapter come into play. Relatively strong distributional assumptions are needed, 
even with the Heckman two-step method. 

The study by Duan et al. (1983) summarized in Section 16.4.2 was criticized for 
using the two-part model, which is more restrictive than the sample selection model. 
This led to considerable debate, with many of the relevant articles referenced in Leung 
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and Yu (1996), who emphasize the important role of potential correlation of the inverse 
Mills ratio term with the remaining regressors. 

More generally, selection models such as the bivariate selection model permit se- 
lection on both observables and unobservables, as it permits selection on both ob- 
served regressors and unobserved errors. It is often more simply referred to as a model 
of selection on unobservables, with selection on observables implicit. This chapter 
emphasizes selection on unobservables. 

If instead we have only selection on observables, analysis becomes much simpler. 
The two-part model of this chapter is an example. Chapter 25 on treatment evaluation 
emphasizes selection on observables (see the discussion in Section 25.3.3) and details 
methods such as propensity score matching. 


16.6. Selection Example: Health Expenditures 


For illustration we use data from the RAND Health Insurance Experiment (RHIE). 
The data extract comes from Deb and Trivedi (2002), who modeled the number of 
outpatient visits to a medical doctor and to all providers using count data models. 
Section 20.3 summarizes the data and Section 20.7 presents estimates of some standard 
count models. 

Here instead we model annual health expenditures. The regressors are the same 
regressors as defined in detail in Table 20.4. They can be broken down into health in- 
surance variables (LC, IDP, LPI, and FMDE), socioeconomic characteristics (LINC, 
LFAM, AGE, FEMALE, CHILD, FEMCHILD, BLACK, and EDUCDEC) and health 
status variables (PHYSLIM, NDISEASE, HLTHG, HLTHF, and HLTHP). The analy- 
sis in Chapter 20 uses four years of data whereas here we use only the second year of 
data, yielding 5,574 observations with summary statistics similar to but not exactly the 
same as those given in Table 20.4. 

The dependent variable y is annual individual health expenditures. An econometric 
model needs to take account of two complications: (1) Health expenditures are zero 
for 23.2% of the sample and (2) the positive health expenditures are very right-skewed 
with a mean of $221 that is much larger than the median of $53. The logarithmic 
transformation eliminates this skewness, with a mean of 4.07 close to the median of 
3.96 and the skewness statistic falls from 24.0 to 0.3. The kurtosis is 3.29, close to the 
normal value of 3. 

We focus on modeling In y for those with positive medical expenditures. Possible 
models include a two-part model, exposited for log medical expenditures in Section 
16.4.2, and a bivariate sample selection model (see Section 16.5.2), where y; in (16.29) 
is an indicator for positive expenditures and y in (16.30) is In y. Note that it is not 
meaningful to consider the value of y) when yı = 0 because lnO is not defined. The 
two-part model is a special case of the bivariate sample selection model with o,, = 0 
in (16.32). 

Table 16.1 presents results for the health insurance variables and health status re- 
gressors. Socioeconomic variables also included in the regression are omitted from the 
table for brevity. 
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Table 16.1. Health Expenditure Data: Estimates from Two-Part and Selection Models 


Model Two-Part Selection Two-Step Selection MLE 
Equation DMED LNMED DMED LNMED DMED - LNMED 
LC —0.119 —0.016 —0.119 —0.028  —0.107 —0.076 
(-4.41) (—0.52) (—4.41) (—0.70) (—4.03) (2.25) 
IDP —0.128 —0.079 —0.128 —0.028 -—0.109  —0.150 
(—2.45) (—1.28) (—2.45) (—0.70) (—2.13) (—2.26) 
LPI 0.028 0.003 0.028 0.005 0.029 0.015 
(3.19) (0.28) (3.19) (0.47) (3.42) (1.42) 
FMDE 0.008 —0.031 0.008 —0.030 0.001 —0.024 
(0.47) (-1.69) (0.47) (-1.62) (0.05) (1.21) 
PHYSLIM 0.273 0.262 0.273 0.281 0.285 0.355 
(3.67) (3.81) (3.67) (3.50) (3.94) (4.70) 
NDISEASE 0.022 0.020 0.022 0.022 0.021 0.029 
(6.25) (5.78) (6.25) (4.29) (6.03) (7.54) 
HLTHG 0.039 0.144 0.039 0.147 0.058 0.156 
(0.88) (2.97) (0.88) 6.01) (1.35) (2.99) 
HLTHF 0.192 0.364 0.192 0.382 0.224 0.445 
(2.29) (4.13) (2.29) (3.98) (2.75) (4.66) 
HLTHP 0.640 0.787 0.640 0.833 0.798 0.999 
6.01) (4.63) 6.01) (4.22) (3.90) (5.32) 
p 0.000 0.168 0.736 
02 1.401 1.570 
O12 = P02 0.000 0.236 1.155 
(0.47) (16.43) 
—ln L 10184.1 10170.1 


“ The f-statistics are in parentheses. Regressors also include eight socioeconomic characteristics. DMED is an 
indicator for whether or not medical expenditures are positive and LNMED is the natural logarithm of expen- 
ditures if positive. The t-statistics for the second step of the two-step selection model are based on errors that 
correct for the first-step estimation used to obtain the fitted inverse Mills ratio term. 


We first compare the two-part model estimates with the two-step estimates of the 
bivariate sample selection model. The DMED equation estimates are identical as they 
are obtained by probit regression of DMED on the same regressors. The LNMED 
equation estimates differ because for two-step sample selection the second-step OLS 
regression for LNMED additionally includes as a regressor the fitted value of the in- 
verse Mills ratio term. This additional term is statistically insignificant (t = 0.47) and 
low in magnitude with implied P = 0.168 that is close to zero. As a result the two 
models lead to similar coefficient estimates in the LNMED equation. 

As noted in Section 16.4.4 the two-step estimator can perform poorly if the inverse 
Mills ratio term is highly correlated with the other regressors. Here this does not appear 
to be the case as there is considerable range in the probit model predicted probabili- 
ties from 0.15 to 0.99 and the condition number (see Section 10.4.4) of the second- 
stage regressors at the second stage, although somewhat high, only doubles from 37 
to 82 upon inclusion of the inverse Mills ratio. Although it is still preferable to have 
some exclusion restrictions, it is not clear in this application which regressors in the 
DMED equation might be reasonably excluded on a priori grounds from the LNMED 
equation. 

The ML estimates of the bivariate sample selection model differ considerably from 
the previous estimates, in both DMED and LNMED equations. The errors in the 


554 


16.7. ROY MODEL 


latent variable models for DMED and LNMED are highly correlated with estimate 
P = 0.736 that is highly statistically significant (t = 16.43). The big difference be- 
tween the two-step estimates and the ML estimates of o12 (or of p) is best viewed 
as signifying a problem with the bivariate sample selection model. Rejection of the 
null hypothesis that the estimates have the same probability limit, a Hausman test 
given in Section 8.4, can be interpreted as rejection of the additional joint normality 
assumption needed to go from two-step estimation to ML estimation of the bivariate 
selection model. However, there may be a more fundamental problem that the bivariate 
sample selection model with the weaker assumption (16.41) and £; iid normal is also 
not reasonable. Such fragility of the bivariate sample selection model is not unusual, 
especially if the same regressors are being used in both parts of the model so that iden- 
tification is being secured through model specification assumptions. It is compounded 
here by use of health expenditure data, which can have quite large outliers so that er- 
rors may not be normal. Even though LNMED has skewness close to 0 and kurtosis 
close to 3, as already noted, standard tests of heteroskedasticity, skewness, and kurtosis 
resoundingly reject (with p-value 0.000) the null hypothesis that LNMED is normally 
distributed. 

The regressor of most interest is LC, the natural logarithm of the coinsurance rate 
where the coinsurance rate equals the percentage of health cost borne by the insured 
paid by the patient. The most statistically significant effect is in determining whether 
or not expenditures are positive, rather than on the size of positive expenditures. If all 
observations were positive then the coefficient of LC in regression on LNMED equals 
the price elasticity of demand for health care. In fact in predicting the effect of changes 
in price on the conditional truncated mean of log expenditure we need to control for 
the effect of those with zero expenditure, as in the second line of (16.43). 

In some applications interest lies in prediction rather than estimation of marginal 
effects. This is complicated in this example by a desire to predict the level rather than 
the log of expenditure. Assuming log-normality, the expression for the two-part model 
is given in (16.28). Duan et al. (1983) present a method to make predictions without 
the log-normality assumption that can be viewed as a variant of a bootstrap. See also 
Mullahy (1998). 


16.7. Roy Model 


In the bivariate sample selection model the dependent variable for an individual might 
not be observed. Thus we observe yz for an individual if y; = 1 but may not observe 
y2 at all if y; = 0. In this section we consider a model in which yz is observed for all 
individuals, but in only one of the two possible states. This important model empha- 
sizes counterfactuals and connects with the program evaluation literature presented 
in Chapter 25. 


16.7.1. Roy Model 


An often-cited article by Roy (1951) considered the consequences for the occupa- 
tional distribution of earnings (both mean and variance) when there is individual 
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heterogeneity in skills and individuals self-select into occupations. The treatment was 
relatively general and nonmathematical, though it did assume that individual worker 
output in an occupation is log-normally distributed in the absence of selection, and it 
did not consider at all estimation of a formal model. During the 1970s a number of 
authors independently proposed models for similar situations that were estimable with 
cross-section data and considered selection on both observables and unobservables. 
Such models have become known as Roy models. 

We define the prototypical Roy model as follows. A latent variable yy determines 
whether the outcome observed is y; or ył. Specifically, we observe whether yj is 
positive or negative, 


1 ify; >0, 
= 16.44 
yı la if y* <0, GEAD 


and observe exactly one of y; and ył according to 


* > * 
is e o (16.45) 


y3 ifyř <0. 
It is customary to specify a linear model with additive errors for the latent variables, 
with 
yi =X 8B, +41, (16.46) 
yž _ X35 + E2, 
Y3 = X43 + 83. 


A model with additive effect is the specialization x4 63 = x, + a. The simplest para- 
metric model for correlated errors is the joint normal, with 


E1 0 1 012 013 

E2 ~N 0 ; 012 os 023 y (16.47) 
2 

E3 0 013 023 03 


where as usual the normalization o? = 1 is used as only the sign of y* is observed. 
The log-likelihood function is similar to that for the bivariate sample selection 
model of Section 16.5, except that now y% is observed if yï < 0, so the term Pr[y/, < 
0] in (16.33) is replaced by f(y3;| yf; < 0) x Prlyii* < 0]. 
It is more common to estimate the model using Heckman’s two-step method applied 
to the truncated means, 


E[y|x, yp > 0] = x58) + 0124x161), 


16.48 
E[yIx, y? < 0] = x48, — o13\(—x,,B;), re 


where A(z) = o(z)/P(z) and we have used o? = |. First-stage probit estimation of 
whether or not yj > 0 yields an estimate of 64 and hence aß). Two separate OLS 
regressions then lead to direct estimates of (82, 012) and (G3, 013). Estimates of os 
and o? can then be obtained using the squared residuals from the regressions, similar 
to the technique used for the bivariate sample selection model after (16.40). Maddala 
(1983, p. 225) provides complete details for this model, which he calls a switching 
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regression model with endogenous switching. This is also the Tobit type 5 model 
presented in Amemiya (1985, p. 399). 


16.7.2. Variations of the Roy Model 


Many models fall into the class of Roy models. Maddala (1983, Chapter 9) gives nu- 
merous references to what he calls models with self-selectivity. See also Amemiya 
(1985, Chapter 10). Here we present a few leading examples. 

The bivariate sample selection model can be viewed as a special case where y} 
is ignored and we only model the truncated moment E[yž|yř > 0]. Bivariate sample 
selection models where y = 0 when yf < 0, such as in labor supply applications, can 
more directly be viewed as Roy models where we observe either y = yž or y = 0, so 
y3 = 0. 

In the study of L.-F. Lee (1978), yž and y% denote, respectively, union and nonunion 
wage and yř denotes tendency to be a union member. This adds the additional structure 
that 


y =y ayers 


where z'y + ¢ reflect costs of union membership and is very much in the spirit of Roy 
(1951). Substituting for y% and y% yields a reduced form for yy: 


YP = (XB, — X33 +Z'Y) + (e2 — £3 +6). 


This model is now the same as the earlier model, with correction term Mx’, 1) obtained 
by first-step probit regression of yı on xı, where x; denotes the unique regressors in 
Xo, X3, and Z. 

If only the intercept varies across the two possible outcomes, by an amount œ say, 
then the Roy model reduces to two latent variables 


yi =, x3, +E, 
yr=xBtay +e, 


where y = y* is always observed and we also observe the binary variable yı equal to 
one if yf > 0 and equal to zero otherwise. This model for y can be viewed as one with 
dummy endogenous variable (y1). It can be estimated using the Heckman two-step 
estimator applied to the expression for E[y*|x]. Alternatively, instrumental variables 
estimation can be used, provided an instrument for yı is available. This requires a re- 
gressor that does not determine the level of the outcome of interest but does determine 
which outcome is chosen. 

These Roy models are similar to the models studied in the treatment effects litera- 
ture. There are two potential outcomes, here yž and y%, but we can only observe one 
of them. The approach in this chapter has been to create the counterfactual by mak- 
ing strong distributional assumptions on the distribution of unobservables. Chapter 25 
presents alternative methods. See especially Section 25.3 for connections between the 
different approaches. 
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16.8. Structural Models 


Regression models for selected samples have the feature that the outcome of inter- 
est depends in part on a participation decision that will in turn depend on expected 
outcomes. The participation decision and outcomes are simultaneous decisions. The 
preceding presentations simplified this interdependence by giving a reduced-form 
version of the participation equation. In particular, see the exposition of Lee (1978) 
in Section 16.7.2. This is a valid approach though is less efficient than working with a 
fully structural version. 

In this section we explicitly model the interdependence using structural economic 
models based on utility maximization, and using structural statistical models that ex- 
tend linear simultaneous equations to cover censoring and truncation, including binary 
outcomes. 


16.8.1. Structural Models Based on Utility Maximization 


Initial structural model research considered female labor supply. The textbook 
model has consumers maximizing utility, a function of goods consumption and leisure 
time, subject to a budget constraint and a time constraint that available discretionary 
time be allocated between leisure time and working time. At an interior solution the 
marginal rate of substitution (MRS) between leisure and goods consumption equals the 
wage rate. However, a corner solution where the woman chooses not to work can arise 
if the MRS exceeds the offered wage. Gronau (1973) and Heckman (1974) presented 
econometric models consistent with utility maximization that led to Tobit-like models, 
accounting for the additional complication that the offered wage is not observed for 
women who do not work. Subsequent advances include incorporation of fixed costs 
of work, leading to sample selection models, and use of panel data, leading to panel 
Tobit models. Killingsworth and Heckman (1986) and Blundell and MaCurdy (2001) 
provide surveys and Mroz (1987) provides an application. 

To illustrate the structural approach we summarize the following example. Dubin 
and McFadden (1984) modelled household consumption of energy (electricity or nat- 
ural gas) and choice of appliances (such as electric heater or natural gas heater) as 
being interrelated decisions coming from the same utility function. Specifically, it 
is assumed that for the jth of m appliance portfolios household indirect utility is 
given by 


Vj = {aoj +01/B + a1 pi + orpr + WYLO — rj) + nje” + ej, (16.49) 


where pı and p denote the prices of electricity and gas, y denotes income, and rj; 
denotes the annualized total life-cycle cost of portfolio j with 


rj = Piqdij + P2q2j + Pcj, 


where qı; and q2; denote the typical electricity and gas consumption by household 
with appliance portfolio j, c; is the cost of appliance portfolio j, and p is the dis- 
count rate. Tastes differ across households owing to observable characteristics w, un- 
observable error 7, and an appliance portfolio specific error ¢;, which is assumed to be 
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independent over j but correlated with 7. In addition, there is a common appliance 
specific taste factor aj. 

Electricity demand x, given appliance portfolio j equals —(0V;/dp))/(0V;/dy), by 
Roy’s identity, yielding 


X1— qij = Xoj + Q1 pı + aps +wytBly —7rj) +7. 


To emphasize that choice of appliance portfolio j is endogenous, introduce m mutu- 


ally exclusive indicator variables 6 j,, k = 1,... , m, where 
spall its 
KS )0 ifk£j. 


Then electricity demand x; given appliance portfolio j is given by 


3 


x1 - qij = X cron d x +aipıi taap +WY+B b — Soran) +n. (16.50) 
k=1 k=l 


Even though the model (16.50) is linear, OLS regression yields inconsistent estimates 
as the result of endogeneity of 6 jų. Dubin and McFadden (1984) present two alternative 
estimation procedures. 

An IV approach estimates (16.50) using px and rj; pz as instruments for 5, and 
rjôjk, k =1,...,m, where PDk are the predicted probabilities of choosing the various 
appliance portfolios. Here V; is being used to denote the indirect utility function. It 
includes both deterministic and stochastic components of utility and corresponds to 
U; in the Section 15.5.1 presentation of the ARUM. A similar approach yields 


pk = Priv > Vi, LAK, L=1,...,m] 
= Pr[er — ek < {(@ox — aor) — Bre — ri)}e ~f”, all 1# k] 
_— expl(ron — Bree Pn /AV3] 
OM, expl(oroy — Brerz /Av/3] 


under the assumption that the £j, j = 1,..., m, are iid type II extreme value with cdf 
F(e) = exp(— exp(—y — em /A»/3)), where y ~ 0.5772 is Euler’s constant. Note that 
here £; has mean zero and variance 7/2 that differ from those for the parameterization 
of the type II extreme value distribution used in Chapters 14 and 15. Estimation of a 
nonlinear multinomial logit model gives predicted probabilities Dy. 

An alternative sample selection approach notes that E[n|portfolio j chosen] 4 0 
and uses assumptions on the distribution of ņ and £1, ..., Em to obtain this expecta- 
tion. Specifically, assume that ņ|£1, ..., & is iid with mean (/20/2) ee R,€, and 
variance o7(1 — X}; R?), where 7, Ry = 0 and S~"_, R? < 1 and the distribu- 
tion of ¢, has already been given. Then performing some algebra given in Dubin and 
McFadden yields 


m li 
E[n|portfolio j chosen] = X (o vV6R,/7) kes + In n| . 
j = Pk 
kAj 
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A Heckman two-step procedure then estimates by OLS 


xX = qij = X ctond jx +a pi +o2p. + wy + B ( a Srian) 
k=l k=l 


+n E uae in| +, 
k#j — Pk 

where p, are predicted probabilities from the preceding model for pz, and & is an error 
with asymptotic mean zero. 

Dubin and McFadden estimated these models using data on 3,249 households with 
two possible appliance portfolios: electric for water and space heating and gas for 
water and space heating. 

Related examples include those of Hanemann (1984), who modeled the consump- 
tion level of a branded good where consumers consume only one of the possible 
branded goods in the choice set, and of Cameron et al. (1988), who modeled health 
service demand conditional on choice of one of a number of mutually exclusive health 
insurance policies. 

Much creativity, evident in the Dubin and McFadden example, can be required to 
specify a model that yields analytical solutions for both choice probabilities and de- 
mand conditional on choice. The advances in computational methods detailed in Chap- 
ters 12 and 13 permit estimation of such models even when analytical solutions are not 
obtained. Nonetheless, results will still be dependent on the assumed utility function 
and distribution of unobservables. 


16.8.2. Simultaneous Equations Tobit and Probit Models 


To illustrate the issues involved in extending the linear SEM approach of Section 2.4 
we consider a selection model that depends on two latent variables and introduce si- 
multaneity into the models for the latent variables. A quite general model is 
YP = 41y} + Vy + ô1y2 + X161 + £1, (16.51) 
yy = 2y] + V2y1 + ô2y2 + X58) + €2, 


where yj and yž are not completely observed but do determine the observed variables 
yı and y2, and the errors are assumed to be joint normally distributed. For example, 
we may observe the binary indicator y; = 1 if yj > 0 and observe y = y3 if yf > 
0. Note that in principal either latent variables or observed outcomes or both may 
appear as regressors, though identification requires restrictions such as those given in 
the following. 


Endogenous Latent Variables 
It is simplest to permit only the latent variables to be regressors in (16.51). Then 
yy = My) + XB, +81, (16.52) 
y3 = ayy + X32 + £2. 
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The bivariate sample selection model (16.31) is an example that additionally specifies 
œz = 0 and directly specifies a reduced form rather than a structural form for the y{ 
equation. Model (16.52) is easily estimated because the reduced form for yf and yž can 
be obtained in exactly the same way as for regular linear simultaneous equations. This 
reduced form can then be estimated using methods such as probit or Tobit depending 
on the way that yı and yz are determined given yj and yž. The parameters of the 
structural model (16.52) can then be estimated by replacing the regressors y; and y/ 
by the reduced-form predictions y; and yy. 

Models such as (16.52) are called simultaneous equations Tobit models. A simul- 
taneous equations probit model arises if the observed dependent variables yı and y2 
are binary. Estimators are presented by Nelson and Olson (1978), Amemiya (1979), 
and Lee, Maddala, and Trost (1980) and a very general treatment for a range of mod- 
els is given in L-F. Lee (1981). The standard errors of the estimators can be obtained 
using the results on sequential two-step m-estimators in Section 6.6. However, it is 
much simpler to obtain them using the bootstrap pairs procedure presented in Sec- 
tion 11.2. Identification requires exclusion restrictions in (16.51) similar to those for 
linear simultaneous equations. 


Endogenous Regressors 


A common specialization of the model (16.52) is to a Tobit model with endogenous 
regressor that is completely observed. Then y; is fully observed, so y2 = yž, whereas 
we observe yı = yj if yj > 0 and yı = 0 otherwise. The model becomes 


yi = ayz + x) 8, + £1, (16.53) 


y2 =x a+, 


where the first equation is the structural equation of interest and the second equation 
is the reduced form for the endogenous regressor y2. Again note that here y2 is con- 
tinuous, not discrete. For joint normal errors ¢; = yv + £, where é is an independent 
normal error (see Section 5.1), so yf = a1 y2 + xi 8, +yu+é. 

A two-step estimation procedure calculates predicted residuals Ù = y2 — x’7 from 
OLS regression of y2 on x and then obtains Tobit estimates from the model 


yI = Oy. +X, GB, +y +e), 


where the error e; is normally distributed. A test for endogeneity of y2 can be imple- 
mented as a Wald test of y = 0 using the usual standard errors from a Tobit package. 
This test is an extension of the auxiliary regression to implement the Hausman endo- 
geneity test in the linear model (see Section 8.4.3). If the null hypothesis is rejected 
then the aforementioned second-step Tobit regression yields consistent estimates of œ 
and G,, but standard errors then need to be adjusted because of first-step estimation 
of the additional regressor 0. See Smith and Blundell (1986) for details for the Tobit 
model and Rivers and Vuong (1988) for a similar procedure that estimates a probit 
model at the second step. 
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Endogenous Censored or Binary Variables 


Analysis is more complicated if the observed censored or binary endogenous vari- 
ables yı or y2 appear as regressors in (16.51). Heckman (1978) considered the follow- 
ing model: 


ye =viyi + dys + X161 + £1, (16.54) 
Y = any} + yay + X58, + £2, 


where we observe yı = | if yf > 0 and y, = Oif yf < 0, and we observe y2 = yj all 
the time. The complication here is that yı appears as a regressor. A meaningful reduced 
form for yf can depend only on x, and x2 and not yı. This imposes the restriction that 
b1¥2 + yı = 0, an example of what is called a coherency condition in this literature. 
Then the reduced form of the model becomes 


* 
Yi SEXT +V, 


yo = Y2y1 +X T2 + V2. 


This is a special case of the Roy model where participation (yı = 1) leads to only an 
intercept shift (via y2) in the outcome. In general, models with regressors that include 
censored or truncated endogenous variables are difficult to estimate. See, for example, 
Blundell and Smith (1989). 


Example 


Brooks, Cameron, and Carter (1998) applied a simultaneous equations Tobit model 
to explain the vote by congressional representatives on a pro-sugar amendment. The 
three observed outcomes y1, y2, and y3 were, respectively, the vote (yes or no) and 
contributions to their campaign funds from sugar interests and (opposing) sweetener- 
user interests. The first outcome is a binary outcome and the other two outcomes are 
censored at zero. A simultaneous equations model for the associated latent variables 
yt, yž, and y% was specified, so the structural model is of the simpler form (16.52). 

How reasonable is this specification? Here campaign contributions y; and y should 
depend on the latent variable yf since the actual vote y; was made at a later date. 
For y; however, an alternative and more difficult model is that y/, the latent variable 
for the vote, depends on actual contributions received (y2 and y3) rather than on the 
latent contributions. However, if this is viewed as a game likely to be repeated in 
the future, a case can be made for using y; and y3. Clearly, the reasonableness of 
such assumptions will vary with the application. Parameter identification was secured 
by exclusion restrictions on the exogenous regressors. Consistent estimation relies on 
errors being joint normally distributed. 


16.9. Semiparametric Estimation 


Censoring, truncation, and sample selection lead to a sample that differs from the pop- 
ulation. This is essentially a missing data problem, one that is complicated because 
data are missing on the dependent variable(s) rather than on exogenous regressors. 
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The preceding methods solved this missing data problem by making distributional as- 
sumptions to obtain either a likelihood function for the sample data or an appropriate 
censored, truncated, or selected conditional mean. 

These methods are fragile to even very minor misspecification of error distributions. 
For example, both the MLE and the Heckman two-step estimator in the standard Tobit 
model are inconsistent if errors are normal but heteroskedastic, or if they are homo- 
skedastic but nonnormal. See, for example, Paarsch (1982) and the references therein. 

Considerable efforts have been devoted to developing semiparametric estimators 
that are consistent under weaker distributional assumptions. Before presenting leading 
examples, however, we note that an alternative is to continue to take a fully parametric 
approach that is based on richer, more flexible distributional assumptions. 


16.9.1. Flexible Parametric Models 


For simplicity begin with the classical Tobit model y* = x;G + ¢;. The assumption 
that e; ~ NO, ora can be relaxed in two ways. First, heteroscedasticity can be incor- 
porated through an explicit model o? = exp(z/-y), where now both 8 and y need to be 
estimated. Second, more flexible distributions than the normal distribution might be 
used. For example, one might use a squared polynomial expansion of the normal (see 
Section 9.7.7). 

For the bivariate sample selection model a similar approach may be taken, where 
now amore flexible joint distribution for (£1, €2) is used. Lee (1983) proposed working 
with transformations (£f, £3) of (e1, €2) for which the bivariate normality assumption 
may be more reasonable. 

Bayesian methods can also be applied to such models. Chib (1992) considered the 
censored Tobit model. The latent variables y* are introduced as auxiliary variables and 
the data augmentation approach (see Section 13.7) is used. The Gibbs sampler cycles 
among (1) the conditional posterior for Bly, y*, 0”, (2) the conditional posterior for 
oly, y*, B, and (3) the posterior for y*|y, B, 0”. 

A flexible parametric approach is particularly advantageous for handling censor- 
ing, truncation, and sample selection in nonlinear models such as those for counts and 
for duration data or mixed types of data, as semiparametric methods are less likely to 
be available then. 


16.9.2. Semiparametric Estimation for Censored Models 


We now move on to semiparametric estimation. We consider a linear model for the 
latent variable y* = x; 6 + £;, which is left-censored at zero so that we observe y; = y; 
if y* > 0 and y; = 0 if y* < 0. The semiparametric literature usually expresses the 
model as 


yi = max(x, 3 + £;i, 0). (16.55) 


This is the Tobit model (16.11)-(16.13), except the distribution of € is unspecified. 
With some adaptation this model also covers left-censoring at known fixed point 
other than zero and to right-censoring such as for top-coded data. For example, if 
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y = min(x’G + £, U) then U — y = max(U — x’ G—«, 0). The goal is to consistently 
estimate 8 without specifying a complete parametric distribution for £;. The estimators 
are called semiparametric as the uncensored mean x; 6 is parameterized but the error 
distribution is not. The methods presented in the following differ in the assumptions 
made on the distribution of €. 

From (16.8) ML estimation is possible given knowledge of the cdf of y* and hence 
of £. The cdf of £ can be nonparametrically estimated using the Kaplan—Meier prod- 
uct limit estimator for the cdf presented in Chapter 17 for the case of right-censored 
duration data. Alternatively, the distribution of £ can be nonparametrically determined 
using the series expansion of Gallant and Nychka (1987); see Section 9.7.7. These 
semiparametric ML estimation methods are rarely implemented. 

Instead, the literature focuses on estimation based on conditional moments. From 
(16.20) the conditional censored mean E[y|x] is clearly a single-index model with 
E[y|x] = g(x’B), where the function g(-) is unknown if the distribution of € is not 
specified. The single-index methods of Section 9.7.4 can therefore be applied, though 
as noted there 8 can be estimated only up to location and scale. 

A more popular approach considers alternative conditional censored moments that 
are less altered by censoring. Powell (1984) proposed using the conditional median. 
The key distributional assumption is that ¢|x has median zero, in which case the con- 
ditional median of y|x equals the conditional mean x’G. The intuition for Powell’s 
estimator is most easily obtained by supposing y is iid. If less than half the sample is 
censored, so that less than half of the observations are zero and more than half are pos- 
itive, then the censored sample median provides a consistent estimate of the population 
median. Powell (1984) extended this idea to the regression case, where the same logic 
follows for those observations for which less than half the observations on e|x are cen- 
sored, where ¢ = y — x’G depends on 8, which needs to be estimated. The regression 
analogue of median estimation is LAD estimation (see Section 4.6). This leads to the 
censored least absolute deviations (CLAD) estimator Go, 4p, which minimizes 


N 
Ov(B)=N" MS ly; — max(x;3, 0)|. (16.56) 


i=] 


The essential assumption for consistency of this estimator is that ¢|x has median zero. 
Given this assumption the estimator is consistent even if errors are conditionally het- 
eroskedastic. The estimator for 3 is /N-consistent and asymptotically normal. More 
efficient estimators can be obtained by weighting the terms in sums by f(O|x;), the 
conditional density of ¢;|x; evaluated at zero. The method can also be extended to 
conditional quantiles. 

An alternative procedure uses a symmetrically trimmed mean, rather than the me- 
dian, that is also unaffected by censoring. Assume that the distribution of ¢|x is sym- 
metrically distributed. This implies that for observations with positive mean (i.e., 
x’ > 0) y|x is symmetrically distributed on the interval (0, 2x’). Then either x’ G+ 
e < 0 and y = 0 is observed or, with equal probability, x’G + € >2x’G and the data 
are artificially set to 2x’G to preserve the symmetry about x’3. We have shown that 


EEB > 0)[min(y, 2x’3) — x’B]x] = 0, (16.57) 
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where 1(x’G > 0) restricts attention to observations with positive mean, and the new 
dependent variable is y = 0, or 0 < y < 2x’, or 2x’B if y > 2x’. The moment esti- 
mator based on (16.57) does not have unique solution for B. Powell (1986b) proposed 
the symmetrically censored least squares (SCLS) estimator that minimizes 


N 
Ov(B) =N! $ {Lyi — max(y;/2, x, BN? + 10; > 2x; 8)Ly?/4 — max(, x; 8V1) 
i=1 
(16.58) 


which with some algebra can be shown to yield first-order conditions that are the 
sample analogue of moment condition (16.57). Chay and Honoré (1998) provide a 
graphical exposition of the trimming for the SCLS estimator, as well as for the related 
pairwise difference estimators of Honoré and Powell (1994). 

Melenberg and Van Soest (1996), Chay and Honoré (1998), and Chay and Pow- 
ell (2001) provide applications of some of these estimators. Pagan and Ullah (1999) 
provide additional methods and theory. 

As an empirical example we applied CLAD estimation to the Section 16.2.1 data 
that were generated from a Tobit model with normal errors. The slope parameter (set 
to 1000) was estimated to be 956 (standard error 117) using ML compared to 838 
(standard error 165) using CLAD. As expected the CLAD robustness to nonormality 
comes at the expense of some loss in efficiency. 


16.9.3. Semiparametric Estimation for Selection Models 


Semiparametric estimation of sample selection models is more challenging. We con- 
sider the most commonly studied model, the bivariate sample selection model de- 
fined in Section 16.5.2, where now we relax the assumption that the errors (£1, £2) are 
joint normally distributed. 

Semiparametric ML estimation is possible. In particular Gallant and Nychka (1987) 
explicitly considered the bivariate sample selection model as a suitable candidate for 
their series expansion estimator presented in Section 9.7.7. 

The literature instead uses as starting point the expression for the truncated condi- 
tional mean, which from (16.34) is given by 


Ely |Xx:, Vij > 0) = x% b2 + Efezle1 > =x; 31] 
= xX); b2 T gb) 


where the second equality assumes that €;|x;, £1; has distribution that depends on just 
Xı; similar to assumption (16.41). The distribution of (e1, £2) is left unspecified so the 
function g(-) is unknown, leading to a semiparametric estimation problem. Since it 
is possible that g(x’ 61) = x| 61, identification in this model with g(-) unspecified re- 
quires an exclusion restriction that at least one component of x; does not appear in xp. 
Moreover, the more uncorrelated x‘ 3, is with x3 the better G, and g(-) can be distin- 
guished. The model (16.59) is a partially linear model, which can be estimated using 
methods presented in Section 9.7.3. Popular methods include the Robinson (1988a) 
differencing estimator and using a series expansion for g(x/3,). Since 6; is unknown 
the regression is of yz; on x}; 3) + 9(x,,,3; ), where B ı can be obtained by regression 


(16.59) 
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of the binary outcome yı; on x;;, using one of the semiparametric binary model esti- 
mators given in Section 14.7. These methods provide consistent estimates of the slope 
parameters 3,. To additionally estimate the intercept, necessary for analysis of the 
levels rather than changes in y2, see Andrews and Schafgens (1998). 

Newey, Powell, and Walker (1990) applied this approach to female labor supply. 
The participation indicator model was estimated using several different methods and 
the equation for the outcome yz was estimated using the method of Robinson (1988a). 
Melenberg and Van Soest (1996) modeled vacation expenditures using a wide range of 
semiparametric methods for both the bivariate sample selection and censored regres- 
sion models. A richer model is provided by Das, Newey and Vella (2003). 

Manski (1989) considered identification in the bivariate sample selection model 
under relatively minimal assumptions and provided bounds for the mean and for 
marginal effects, conditional on both regressors and selection. 


16.10. Derivations for the Tobit Model 
16.10.1. Truncated Moments of Standard Normal 


Consider z ~ N[0, 1], with density ¢(z) = (1/V 27) exp(—z”/2) and cdf ®(z). Since 
Pr[z > c] = 1 — ®(c), the conditional density of z|z > c is ġ¢(z)/(1 — ®(c)). It fol- 
lows that 


Elzlz > cl =i (G/L — ©) dz 


Ẹ 


zil 2(1//2z) exp(—z?/2) az fu — ¢(c)] 


ng 
=f 5g (0/2) exp(—2?/2)) ac/t ~ &(0)] 
= [amea] u - eo) 
=$(O/Ul — #0] 


Similarly, 


Ezz > c] =f z ($ [1 — (0) dz 


C 


=f z x z x (1/v/2m)exp(—z°/2) az f 1 — (c)] 


Cc 


= zx 2 (—a/v2m)exp(-<2/2)) az f u — &(0)] 
3 Oz 
= [z x C1/vimexp(-2/2)]" / 1 - 
f l ste) x ( (1/2) exp(—z?/2)) dz / [1 — o] 
=c — A+A- Pe — eCe) 


=cġ(c)/[1 — ©] +1. 
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It follows after a little algebra that 


V[zlz > c] = E[z?|z > c] — Elzlz > ely’ 


= 1 + co(©)/L1 — PO- p/i — OP. 


16.10.2. Asymptotic Theory for Heckman’s Two-Step Estimator 
in the Tobit Model 


The asymptotic variance matrix of the two-step Heckman estimator is complicated by 
its dependence on first-step parameter estimates. There are several ways to obtain the 
asymptotic variance, such as that in Amemiya (1985, pp. 369-370). Here we instead 
apply the general result for sequential two-step m-estimators given in Section 6.6. 
We consider the simplest case of the Tobit model (see Section 16.3.6). The methods 
can be adapted to two-step estimators for the bivariate sample selection model (Sec- 
tion 16.5.4) and simultaneous equations Tobit model (Section 16.8.2). A much simpler 
quite different approach is to use the bootstrap pairs procedure (see Section 11.2). 

From (16.26) we wish to estimate the parameters y = [3’ o] in the equation for 
positive y;: 


Yi =X, B+ 0X; a) + ni 
= w(a)y+ ni, 


where w;(@) = [x; Aj(x,av)]' and n; = yi — xX; 6 — oA(x,q@) is heteroskedastic with 
variance o defined in (16.24). The first step of the two-step procedure is to obtain 
an estimate @ of the unknown parameter @ by probit MLE. It follows that the normal 
equations for the two parts of the Heckman two-step estimator are 


3 = Da, oa) /=0 16.60 
ee O°) Seal — ba) ’ ( à ) 


N 
diwi(œ)( y; — wi(a)'y) = 0, 
=a 


where the first equation gives the probit first-order conditions for œ, and the second 
equation gives first-order conditions for ~y for OLS on positive y; (d; = 1). 

These equations can be combined as yor h(x;, 0) = 0 where 0 = (a’, y’)’. By 
the usual first-order Taylor series expansion ¥ — ~y $ NT0, Go 'So(Go '\’]] where 
Go = lim N'ES; dh(x;, 0)/30] and So = lim N~'E[~™, h(x;, O)h(x;, 0). We 
are interested in the subcomponent corresponding to y. Simplification occurs because 
dh(x;, 0)/30 is block triangular because ~y does not appear in the first set of equations. 
Partitioning yields the general result 


V[O2] = GZ {Sx + Ga [GR S1 G7 IG), — Goi G7 1810 — $21G7'G5,} G2, 


where the matrices are defined in Section 6.6. 
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Specializing to the problem here, we first consider the terms in Go. Then 


p xa) r 
Gu = lim y Xi Sal- O(a) *” 
G21 = lim x pe diw SAE 


Gy = lim x Tey E[d; w;w;]. 


The expression for G1; uses knowledge that Gif is just the variance of the probit MLE. 
The expression for G2; uses 


5 Ea -E | dd; wi(@)(yi — man 
00, Ja 
-E lw Od; Wi =] 


da’ 


A(x. 
=E [aw 9 a] a 
0a 


The expression for Gy uses 


dh; Od; wi(@)(yi — wilay yY) 


= = d;wiw; 
30, ay a 
Turning to Sọ we have 
Si = Gi. 
So = 0, 


Soo = lim 4 ZX | Eldi( yi — wi(a)'y)’1- 

The expression for S,; follows by applying the information matrix equality. Taking 
expectations and some manipulation leads to S2; = 0, and Sz is simply V[n;]. 

Combining these results gives the Heckman two-step estimator 7 ~ N (y, V+) 
where 

V, = (WW) | (WS,W + WDV,DW) (WW) |, (16.61) 

and where W'W = ya diWiW.., D = = Diag| d(x; a)/ða |a J], V a is the vanance ma- 
trix for the first-stage Probit MLE, and iz is a diagonal matrix with ith entry oF. This 
estimate is straightforward to obtain if matrix commands are available. The hardest 
part can be pee, obtaining o? =V [n;] given in (16.24). If this is difficult we can 
instead use o? = (yi — X; B + 6A; «a? following the approach of White (1980). 


16.11. Practical Considerations 


Most major packages include ML estimation of the Tobit model under normality. The 
two-part model is easy to estimate as one can separately estimate each part. In principle 
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the bivariate sample selection model can be estimated by Heckman’s two-step proce- 
dure using only a probit and OLS routine. However, the standard errors are difficult 
to compute owing to the two-step nature of the estimator, and it is much easier to 
obtain standard errors using a package with Heckman’s two-step procedure built-in. 
Implementing semiparametric estimators generally requires specialized code in a pro- 
gramming language such as GAUSS. Some packages also permit ML estimation of 
censored and truncated variants of other models, such as the Poisson and negative 
binomial for count data. 

Censoring and truncation are easily handled if one views as reasonable the specified 
distribution. For example, top-coded income data are easily handled if the log-normal 
distribution fits the data well. Censored LAD, which relies on much weaker distribu- 
tional assumptions, can also be used in this situation. 

Much more problematic is handling models with sample selection. The more para- 
metric versions of these models can rely on distributional assumptions that are felt to 
be strong. Semiparametric versions still have to struggle with the identification require- 
ment that a variable that determines participation does not also determine the outcome 
of interest. A more promising route, one often taken in the treatment effects literature, 
is to limit attention to cases where it may be reasonable to assume that selection is only 
on observables. 


16.12. Bibliographic Notes 


The literature on models from selected samples is vast. Book-length treatments are provided 
by Maddala (1983) and Gouriéroux (2000), and shorter summaries are provided by Amemiya 
(1984, 1985) and Greene (2003). 


16.3 Tobit (1958) proposed and applied the Tobit model to expenditure data. Amemiya 
(1973) formally established its consistency and asymptotic normality. Heckman 
(1974) provides an excellent female labor supply application with detailed analysis 
of results. 


16.4 The many studies of the Rand Health Insurance Experimant, such as that by Duan 
et al. (1983), are leading applications of the two-part model. 


16.5 Heckman (1976, 1979) presented the two-step estimator of the bivariate sample se- 
lection model that is also the basis for many more recent semiparametric estimation 
procedures. Mroz (1987) provides an excellent application to female labor supply 
that places emphasis on the role of assumptions on wage exogeneity. 


16.7 There are many variants on the ideas of Roy (1951), just as there are many variants 
of the Tobit model. L-F. Lee (1978) provides a good early application to the union— 
nonunion wage differential. 


16.8 The work by Dubin and McFadden (1984) is a leading example of structural micre- 
conometric analysis based on complete specification of utility function and distribu- 
tion of unobservables. 


16.9 Semiparametric estimation of binary choice models is presented in detail in the books 
by M-J. Lee (1996), Horowitz (1997), and Pagan and Ullah (1999) and in surveys by 
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Vella (1998) and L-F. Lee (2001). Chay and Honoré (1998) and Chay and Powell 
(2001) provide applications for censored models, and Melenberg and Van Soest 
(1996) additionally estimate bivariate sample selection models. 


16-1 


16-2 


16-3 


Exercises 


This question considers the impact of different degrees of truncation in the Tobit 
model. 


(a) Generate 200 draws of a latent variable y* = k + 3x + u, where u ~ NO, 3] 
and the regressor x ~ uniform[0, 1]. Choose k such that you generate ap- 
proximately 30% of y* to be negative. 

(b) Generate a censored or truncated subsample by excluding observations 
that correspond to y* < 0. 

(c) Estimate the model using all 2,000 observations, as if the latent variable 
were observable, by OLS. Evaluate your results in the light of the theoretical 
properties of OLS, keeping in mind that you have only one replication. 

(d) Using the truncated subsample of y > 0 only, estimate the model by OLS. 

(e) Use the truncated maximum likelihood option to estimate the parameters 

using all observations. Evaluate your results in light of the properties of the 

truncated MLE. Compare with the least-squares results from the previous 
two parts. 

Repeat all previous steps using a value of k so as to generate 20, 40, and 

50% censored observations. Compare your results with those based on 

30% censored observations. Hence suggest what is the consequence on 

the parameter estimates of higher levels of censoring. Reinforce your argu- 

ments using theory where possible. 


(f 


— 


Consider a latent variable modeled by y* = x,3 + ¢ with e; ~ [0,07]. Sup- 
pose y* is censored from above so that we observe y; = y; if yř < U; and 
yi = U; if y;* > U;, where the upper limit U; is a known constant for each in- 
dividual (i.e., data) and may differ over individuals. 


(a) Give the log-likelihood function for this model. [Hint: Note that this differs 
from the standard case both owing to presence of U; and because the equal- 
ities are reversed with y; = y* if y“ < Uj.] 

(b) Obtain the expression for the truncated mean E[y;|x;, y; < Uj]. [Hint: For z~ 
NO, 1], we have E[Zz> d = o(0)/[1 — &(d)]. Also, E[z|z < ce] = —E[-z| — 
Z>—c] and -z~ N(0, 1].] 

(c) Hence give Heckman’s two-step estimator for this model. 

(d) Obtain the expression for the censored mean E[y;|x;]. [Hint: An essential 
part is the answer in part (b).] 


This question considers the consequences of misspecification in the Tobit 
model. The starting point is the model of Exercise 16.1. 


(a) Generate y* with heteroskedasticity by letting u ~ N/[0, o?z], where z> 0 
is chosen to be a suitable positive-valued variable that is correlated with 
x, though not perfectly so. Again set k to obtain about 30% of censored 
observations. Use the MLE for censored normal to estimate this model and 
compare your results with the corresponding homoskedastic case. 
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(b) Now consider the impact of nonnormality in the sample. Use the simulation 
macro available in some packages to carry out a Monte Carlo evaluation 
based on a sample of 1,000 observations and 500 replications. In each repli- 
cation generate a sample with censored observations such that the errors 
are drawn from a mixture of two normals: N[1, 9] or M[0.4, 1] with prob- 
abilities 0.4 and 0.6, respectively. Estimate the model using the censored 
Tobit MLE and compare your results with the normal case. Carry out an 
analysis of the Monte Carlo output for the two estimators. Draw appropriate 
conclusions about the impact of nonnormality on the distribution of the Tobit 
estimator. 


Consider a Poisson regression model where y* has density f*(y*)= 
e€ "uY/y*!, y*ž = 0,1,2, ..., and we have independence over i. Because of cod- 
ing error we only fully observe y* when y* > 2. When y* = 0 or 1 we only ob- 
serve that y* < 1. Suppose this is coded as y* = 1. Define the observed data 
y= y* for y; > 2 and y= 1 for y* = 0 or 1. 


(a) Obtain the density f(y) of the observed y. 

(b) Obtain E[y]. [There is some algebra here.] 

Now introduce regressors with E[y*|x] = exp(x’3) and define the indicator 
variable d= 1 for y* > 2 and d = 0 for y* = 0 or 1. 

(c) Give the exact formula for this example of the objective function of an es- 
timator that provides a consistent estimator of G using data on y;, dj, and 
xj. 

(d) Give the exact formula for this example of the objective function of an es- 
timator that provides a consistent estimator of G using data on only q; and 
xj. 

(e) Is it possible to consistently estimate G using data on only q; and x;? Explain 
your answer. 


Using a 50% random subsample of the RAND data on medical expenditure over 
a 12-month period used in this chapter, and using a similar model specification, 
we wish to consider the following broad question: Which model is appropriate 
for modeling the expenditure data? 


(a) Using the data summary of the expenditure variable, analyze the implica- 
tions of the high proportion of zero expenditures observed. Is this a violation 
of the normality assumption? Is there a transformation of expenditure that 
would make the assumption of normality more appropriate? 

Three candidate models are considered, each with the same set of covari- 
ates. These covariates are the same as in the count data Exercise 20.6. The 
models are (i) the Tobit model, (ii) the two-part (“hurdle”) model (TPM), and 
(iii) the selection model. Explain how each one of these will be set up, the re- 
lationship and connections among them, and how one might compare and 
choose among them. If you are likely to encounter any specific specifica- 
tions or estimation problems, state them and suggest how you might handle 
them. Pay attention to the choice of exclusion restrictions. 

Estimate in turn the Tobit model, the TPM, and the selection models. For the 
TPM you have two equations, and the second is for those who have positive 
expenditures only. In the case of the selection model, use both the MLE 
and the two-step (Heckman) estimators. Discuss your reasons underlying 


(b 


~ 


(c 


~ 
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the exclusion restriction required in the estimation of the selection model. Is 
there evidence that the selection problem is a serious issue? 

How can we compare the statistical fit of the three models? Which model 
appears to provide the best fit to the data? By what criterion? 

Suppose our main interest is in the impact of two variables on expenditure, 
log income, and log of (1 + coinsurance rate). Use the results of your esti- 
mated Tobit model and TPM to make a comparison between the marginal 
impact of a change in these variables on expenditure. Given that there is 
considerable heterogeneity in the sample, suggest how to present the re- 
sults of your analysis in the most informative manner. 

Briefly explain how quantile regression (see Section 4.6) provides an alter- 
native method of analyzing the same data. What are the main advantages 
and disadvantages of this approach in the present data situation? 
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Transition Data: Survival Analysis 


17.1. Introduction 


Econometric models of durations are models of the length of time spent in a given 
state before transition to another state, such as duration unemployed or alive or without 
health insurance. In biostatistics a duration in a state is also known as lifetime and the 
time of transition is referred to as death; in operations research where one often studies 
lifetimes of physical objects such as light bulbs and machines, the end of useful life, 
that is, transition to useless life, is called failure time. In econometrics a state is a 
classification of an individual entity at a point in time, transition is movement from 
one state to another, and a spell length or duration is the time spent in a given state. A 
typical regression example is determining the effect of higher unemployment benefit 
levels on the average length of an unemployment spell or the probability of transition 
out of unemployment. 

The literature on this subject can be quite daunting, for a number of reasons. First, 
several related distributional functions are of interest and either the duration or prob- 
ability of transition may be modeled. Second, many different sampling schemes are 
possible and statistical inference depends on both the duration model and the sampling 
scheme. For example, sampling methods for data on unemployment duration include 
flow sampling of those entering unemployment in a given month, stock sampling of 
people unemployed in a given month, and population sampling of all people regardless 
of employment status. Third, the data on spell duration are often censored. This is a 
major reason for modeling transitions rather than the mean duration, the usual object 
of regression analysis, as weaker distributional assumptions are needed to consistently 
estimate models of the transitions. Fourth, transition data can be very rich with sev- 
eral states, such as unemployment, part-time employment, full-time employment, and 
out-of-the labor force, and data for a given individual may be available on multiple 
transitions among these states. Fifth, the literature appears in several different applied 
areas of statistics with different emphases. Duration analysis or transition analysis 
is also called survival analysis (length of time survived) in biostatistics, failure time 
analysis (length of time to failure of an item such as a light bulb or a machine part) 
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in operations research, life table analysis in demography and actuarial studies (where 
leaving a state corresponds to death), and hazard analysis in insurance and accident 
theory. In the social sciences applications include recidivism, length of marriages, and 
interelection duration. 

In this chapter we present results for single-spell duration data obtained by flow 
sampling. The classic example is modeling survival time, with transition being from 
alive to dead, and many of the results come from survival analysis and life table analy- 
sis. This is the most studied example of transition analysis in statistics, and the survival 
analysis methods presented in this chapter are implemented in many statistical and mi- 
croeconometric packages. The chapter begins with a regression example to outline the 
issues raised with survival data. 

Sections 17.3-17.5 present results without regressors, as many new concepts arise 
even in this case. Section 17.3 introduces basic duration data concepts such as the 
hazard, cumulative hazard, and survivor functions. Section 17.4 defines various types 
of censoring, a common complication in duration analysis because the completed spell 
is not always observed. For example, a clinical trial will usually end before the last 
subject dies. Section 17.5 presents nonparametric estimators of the hazard, cumulative 
hazard (Nelson—Aalen estimator), and survivor functions (Kaplan—Meier estimator) 
that are consistent under independent censoring. 

The remainder of the chapter extends analysis to regression models, again un- 
der independent censoring. Estimation of fully parametric models, notably the 
Weibull model, is presented in Section 17.6. The treatment of censoring is simi- 
lar to that given for fully parametric Tobit models. Some important duration mod- 
els are given in Section 17.7. An alternative semiparametric approach is to in- 
stead model the hazard function, the probability of death conditional on survival 
to date. In his seminal paper, Cox (1972) proposed a method to consistently esti- 
mate a proportional hazards function with independent censoring under relatively 
weak distributional assumptions. The Cox model, the standard model for survival 
data, is presented in Section 17.8. Unlike most cross-section models, in survival 
models regressors such as unemployment benefits in an unemployment duration 
model may vary for a given person over the period that the subject is observed. 
Models with time-varying regressors are detailed in Section 17.9. Discrete haz- 
ards models are presented in Section 17.10. Section 17.11 presents an empirical 
example. 

Two subsequent chapters consider more complicated aspects of transition modelling 
that are rarely given a textbook treatment. These include unobserved heterogeneity, 
multiple spells, and multiple destinations. 


17.2. Example: Duration of Strikes 


Consider a data set on the duration of strikes that has been used by Kennan (1985), 
Jaggia (1991c), and others. The variable of interest is the duration of strikes in U.S. 
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Kaplan-Meier Survival Function Estimate 
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Figure 17.1: Strike duration: Kaplan-Meier estimate of survival function. Data on completed 
spells for 566 strikes in the U.S. during 1968-76. 


manufacturing, measured in number of days from the start of the strike. The sample 
has 566 complete (uncensored) observations on strike duration. The average duration 
of strike (dur) is 43.6 days, and the median is about 28 days. However, 90 days after 
the start of the strike 88 strikes are still in progress. 

We can show the strike duration information graphically as an empirical survival 
function. Figure 17.1 shows on the vertical axis the proportion of strikes started that 
are still in progress after a stated number of days. Calender time is ignored in this 
figure, meaning that the different start date of different strikes plays no role in the 
construction of the figure. As expected, the function starts at one and monotonically 
declines to zero, indicating that all strikes must eventually end. 

Now introduce a regressor variable (z) that measures the deviation of output from its 
trend level, an indicator of the business cycle position of the economy. Positive values 
of z indicate above-trend growth period and negative values indicate the converse. 
Suppose that our main interest lies in testing whether average strike duration is pro 
cyclical (i.e., (dur)/dz > 0) or anticyclical (i.e., d(dur)/dz < 0). A simple way to 
proceed might be to model the conditional expectation of In(dur) by a linear regression 
of In(dur) on z. This may serve the purpose if one is testing for the presence of a 
positive or negative association between dur and z. 

Possibly we might instead be interested in modeling the conditional probability of 
a strike. Such a goal could be achieved by a binomial regression with a 0/1 outcome 
variable. However, suppose that our interest is in modeling the probability that a strike 
that has been in progress for t days will end on day ¢ + 1, or in modeling the condi- 
tional probability of the strike in progress ending, as a function of the length of the 
strike, controlling for z; then the previously mentioned regression approaches will be 
less direct and less efficient than survival analysis, which also has the additional ad- 
vantage that it can handle censored durations. In the next section we will consider 
statistical concepts that are used in survival analysis. 
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17.3. Basic Concepts 


Duration in a state is a nonnegative random variable, denoted T, which in economic 
data is often a discrete random variable. For explaining the basic concepts we focus on 
the continuous case, followed by the discrete case later in the chapter. 


17.3.1. Survivor, Hazard, and Cumulative Hazard Functions 


The cumulative distribution function of T is denoted F(t) and the density function 
is f(t) = dF (t)/dt. Then the probability that the duration or spell length is less than t 
is 


F(t) =Pr[T < t] (17.1) 


= [ f(s)ds. 
0 


A complementary concept to the cdf is the probability that duration equals or ex- 
ceeds f, called the survivor function, which is defined by 


S(t) = Pr[T > t] (17.2) 
Srey: 


The definition of the cdf in (17.1) equals the usual definition, following Kalbfleisch 
and Prentice (2002). In the duration analysis literature other authors, such as Lan- 
caster (1990) instead define F(t) = Pr[T < t] and hence S(t) = Pr[T > t] because 
hazard functions, defined below, condition on T > ¢ rather than T > t. The particu- 
lar definition used will make a difference in the discrete case, considered in Section 
17.3.2, at the exact time that a transition occurs. 

The survivor function is monotonically declining from one to zero since the cdf 
is monotonically increasing from zero. If all individuals at risk of leaving the state 
eventually do so then S(oo) = 0. Otherwise, S(oo) > 0 and the duration distribu- 
tion is called defective. The sample mean of a completed spell length is the integral 
ie S(u)du. To obtain this result, use 


f uf(u)du = E udF(u) = uF) — f F(u)du. 
0 0 0 
Since F (co) = 1 and F (0) = 0, it follows that 
E[T] = f (1 — F(u))du = i S(u)du. (17.3) 
0 0 


The mean duration equals the area under the survival curve. 
Another key concept is the hazard function, which is the instantaneous probability 
of leaving a state conditional on survival to time t. This is defined as 


. Pritt <7T <t+Atr|T >t) 
A(t) = iim, (17.4) 
t> 


At 
_ fO 
~ S(t)” 
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Table 17.1. Survival Analysis: Definitions of Key Concepts 


Function Symbol Definition Relationships 

Density f(t) f(t) = dF(t)/dt 

Distribution F(t) Pr[T < t] F(t) = Je f(s)ds 

Survivor S(t) Pr[T > ft] S(t) = 1—- F(t) 
. Prt <T <t+AlT >t] 

Hazard X(t) lim A Mt) = f(t)/S@) 

Cumulative hazard A(t) vis Ms)ds A(t) = — ln S(t) 


It is easily verified that the hazard equals the change in log-survivor function, 


Mi) = ISO) 
dt 
The hazard A(t) specifies the distribution of 7. In particular, integrating A(t) and using 
S(O) = 1 we can show that 
t 
S(t) = exp (-f handu) ? (17.5) 
0 


In regression analysis of transitions the conditional hazard rate, A(t|x), is of central 
interest. This contrasts with more standard regression approaches in which the condi- 
tional mean function, E[T |x], is of chief interest. The latter approach has the disad- 
vantage that in practice the durations are often censored. 

A final related function is the cumulative hazard function or integrated hazard 
function 


ao= f A(s)ds (17.6) 
0 
= —ln S(t), 


where the last equality uses (17.5). If S(oo) = 0 then A(oo) = œ. The cumula- 
tive hazard is of interest as it can be more precisely estimated than the hazard 
function. 

For any choice of distribution of T, it can be shown that the transformation A(T) is 
unit exponentially distributed and In A(T) is extreme value distributed, providing the 
basis for model specification tests, see Section 18.7.2. 

Various related functions for the nonnegative continuous random variable T are 
summarized in Table 17.1. 

Other functions are also used at times, most notably the Laplace transform L(s) = 
E[exp(—sT)], s > 0, which is a variant of the moment-generating function for random 
variable T restricted to be positive. 


17.3.2. Discrete Data 


It is very common for a duration to be measured as an interval. For example, data may 
indicate that a transition occurred in a particular week, but the exact time in the week 
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is not given. In such cases the transition times are said to be grouped and it is assumed 
that the hazard within the interval is constant. Discrete-time hazard models deal with 
such data. 

The starting point is to define the discrete-time hazard function as the probability 
of transition at discrete time t;, j = 1,2,..., given survival to time f;: 


= fU St (tj), 


where the superscript d denotes discrete, and where S\(a_) = lim, .¢_S“(t;), an ad- 
justment made because formally S(t) equals Pr[T > t] rather than Pr[T > t], and the 
superscript d denotes discrete. 

The discrete-time survivor function is obtained recursively from the hazard func- 
tion as 


S(t) = Pr[T > t] (17.8) 
M 0-A). 


jltj<t 


For example, Pr[T > t2] equals the probability of no transition at time ft, times the 
probability of no transition at time t conditional on surviving to just before t2, so that 
Pr[T > n] = (1 — å1) x (1 — Ay). The function S“(t) is a decreasing step function 
with steps at tj, j =1,2,.... 

The discrete-time cumulative hazard function is 


A= E A; (17.9) 


J\tjSt 


Using (17.7), we have that the discrete probability that the spell ends at t; is 
Aj S4(t;). 

The continuous and discrete cases can be combined. The survivor function is then 
defined using the product integral, which reduces to the regular product (17.8) 
in the discrete case and to the exponential of the regular integral (17.5) in the 
continuous case. See Kalbfleisch and Prentice (2002, p. 10) or Lancaster (1990, 
pp. 10-12). 

Discrete duration data may arise because the process generating transitions is in- 
trinsically discrete. More often, however, the underlying process is continuous but the 
data are observed discretely. For example, one may know the week or month in which a 
spell ends, but not the day or hour. Such data are sometimes known as grouped data. 
The discrete data formulas can be used as follows. Let time be divided into k + 1 
intervals [ao, a1), [a1, a2), ..., [ak-1, ak), [ak, Goo). The discrete time duration T = t; 
indicates a transition in the interval [aj—1, aj), that is, transition at time aj—1 or later. 
It is customary to treat discrete data as resulting from grouping, so that transitions are 
modeled in continuous time and then necessary adjustments are made for grouping. 
Further discussion is given in Section 17.10. 
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17.4. Censoring 


Survival data are usually censored, as some spells are incompletely observed. That 
is, the lifetimes are only known to lie in certain intervals. As an example, instead 
of observing the length of completed spell of unemployment, data may come from a 
survey of the currently unemployed, so that only the length of an incomplete spell of 
unemployment is observed. 


17.4.1. Censoring Mechanisms 


In practice data may be right-censored, left-censored, or interval-censored. For right- 
censoring or censoring from above, we observe spells from time 0 until a censoring 
time c. Some spells will have ended by this time anyway (completed spells), but others 
will be incomplete and all we know is that they will end some time in the interval 
(c, oo). Left-censoring or censoring from below occurs when spells are known to end 
at some time in the interval (0, c) but the exact time is unknown. The classical Tobit 
model is an example, where data on some spells are lost and the censoring time is 
unknown. Interval-censoring occurs when the completed spell length is observed but 
only in interval form such as in [t/, t3). 

The survival analysis literature has focused on right-censoring. Even with this re- 
striction there are a variety of possible reasons for censoring, including random cen- 
soring, type I censoring, and type II censoring. 

Random censoring or exogenous censoring means that each individual in the 
sample has a completed duration 7;* and censoring time C; that are independent of 
each other. We observe the completed duration T* if the spell ends before the cen- 
soring time and the censoring time C; if the spell ends after the censoring time. 
In addition it is known whether or not censoring has occurred. The observed data 
(ti, 61), (t2, 52), ..., (ty, dy) are realizations of the random variables 


T; = min(T*, C*), (17.10) 
ô; = 1[T¥ < C*], 


where the indicator function 1[A] equals one if event A occurs and equals zero oth- 
erwise. Note that ô; equals one if a completed spell is observed and equals zero 
otherwise. Random censoring may result from causes such as random failure to fol- 
low up a case, individuals randomly dropping out of the study, or termination of the 
study. 

Type I censoring occurs when durations are censored above a certain fixed known 
censoring time, say tą. For example, a sample of light bulbs may be tested for no more 
than 5,000 hours, with a common starting time for all items. Thus at the termination of 
the study the failure times or durations of some items will be known but other objects 
will still not have “failed.” Their lifetimes are said to be right-censored. This is a 
special case of random sampling, with C* = te. The classic Tobit model is an example 
of type I censoring from below for a random variable continuous on (—0o, 00). 
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17.4.2. Independent (Noninformative) Censoring 


For standard survival analysis methods to be valid in the presence of censoring the 
censoring mechanism needs to be one with independent (noninformative) censor- 
ing. This means that parameters of the distribution of C* are not informative about the 
parameters of the distribution of the duration T*. Then one may treat the censoring in- 
dicator 5 as exogenous, and it is then not necessary to model the censoring mechanism 
if interest lies in the duration model parameters. 

For censored data (t, 5) the uncensored observations are observed with probability 


Pr[T = t, 8 = 1] = Pr[T = t| 8 = 1] x Prd = 1] 


If the censoring mechanism is independent then Pr[T = t|6 = 1] = Pr[T = t]. If the 
censoring is noninformative then the term Pr[é = 1] can be dropped from the likeli- 
hood function as it does not involve parameters of the distribution for T. Similarly, for 
censored observations, 


Pr[T = t, 8 = 0] = Pr[T > t| 5 = 0] x Pr[ô = 0] 


with Pr[T > t| ô = 0] = Pr[T > t] under independent censoring and Pr[é = 0] being 
ignored under noninformative censoring. Combining, the density of interest reduces to 
Pr[T = t] when ô = 1 and Pr[T > t] when ô = 0. 

When regressors x are introduced it is possible for T* and C* to vary with the same 
regressors. Again what matters is that C* parameters are not informative about the T* 
parameters. Even more simply, at any given point in time, censoring must not occur 
because a subject has unusually high or low risk of failure given x. 

Type II censoring occurs when observation on N subjects ceases after the pth 
failure. Then only the durations for the p shortest spells are completely observed, 
and the remaining N — p are censored at C; = tp), the duration of the pth shortest 
complete spell. For example, a clinical trial may end after p patients have died. 

Random, type I, and type II censoring are all examples of independent censoring. 
A more formal treatment is given in Kalbfleisch and Prentice (2002, pp. 194-196). 


17.5. Nonparametric Models 


This section deals with nonparametric estimation of survival functions. These methods 
are very useful for descriptive purposes. It is often insightful to know the shape of 
the raw (unconditional) hazard or survival function before considering introducing 
regressors. The strike duration example illustrates the point. 

We present estimators of the survivor, hazard, and cumulative hazard functions in 
the presence of independent censoring. Nonparametric estimation of the density itself 
is not considered because of the difficulty introduced by censoring; more importantly 
the survivor and hazard functions are more interpretable than the density. 

No regressors are included. If interest lies in just a few key values of regressor(s), 
such as different treatment regimes or levels of treatment, then one can obtain sep- 
arate nonparametric estimates at each key value and compare them. In economics 
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applications this is rarely the case and more structural models with regressors, pre- 
sented in Sections 17.6—17.10, are needed. 

We focus on discrete durations, such as life table data, so that the discrete-time 
formulation of Section 17.3.3 is used. Consider, for example, a cohort of No individuals 
of specific age and gender, which is subsequently tracked for a number of years. At 
the end of year 1, there are N; individuals in the cohort, and N; — No individuals 
from the original cohort have either died or been lost for other reasons (censored). 
A year later the size of the cohort is Nz — Nj, and so forth. Such life table data can 
be used to construct a discrete-time survivor function without any prior parametric 
assumptions. 


17.5.1. Nonparametric Estimation 


With no censoring the obvious estimator of the survivor function is one minus the 
sample cumulative distribution function. Then S(t) equals the number of spells in the 
sample of duration greater than t, divided by the sample size N. This is a step function 
with jump at each discrete failure time; see Figure 17.1. An alternative equivalent 
representation of this estimator, given momentarily in (17.13), maintains consistency 
in the presence of independent censoring. 

Letti < t <- < tj <- < tk denote the observed discrete failure times of the 
spells in a sample of size N, N > k. Define d; to be the number of spells that end 
at time f;. Since the data are discrete d; may exceed one. Some spells may be in- 
completely observed. Define m ; to be the number of spells right-censored in the in- 
terval [t;, t;+1). The censoring mechanism is assumed to be independent censoring, 
so the only thing known about a spell censored in [t;, t;+1) is that the failure time is 
greater than t;. Spells are at risk of failure if they have not yet failed or been censored. 
Define r; to equal the number of spells at risk at time ¢;_, that is, just before time 
tj. Then rj = (dj + mj) +--+ + (dk + mg) = Vis (Gi +m). Note that rı = N. In 
summary, 


d; = # spells ending at time tj, (17.11) 
m; = # spells censored in [t;, t;+1), 


r; = # spells at risk at time t;_ = Xa + mı). 
lll>j 


The discrete-time formulation of Section 17.3.2 is used. Since A; = 
Pr[T =t;|T > tj], an obvious estimator of the hazard function is the number 
of spells ending at time t; divided by the number at risk of failure at time ¢;_, or 


a) = =. (17.12) 


The discrete-time survivor function is defined in (17.8). The Kaplan-Meier esti- 
mator or product limit estimator of the survivor function is the sample analogue 


$= a-i- T Eo: 


jltjst jltjst Tj 


(17.13) 
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Table 17.2. Hazard Rate and Survivor Function Computation: Example‘ 


J rj dj mj rj =dj/r; At) St;) 

1 80 6 4 6/80 6/80 (1—6/80) 

2 70 5 3 5/70 6/80 + 5/70 (1—6/80) x (1—5/70) 
3 62 2 1 2/62 A(t2) + 2/62 S(t2)x(1—2/62) 
4 5 


“ At time tj, rj is the number of observations at risk, d; is the number of deaths (failures), m j is the number of 
missing spells (censored), Rj is the estimated hazard rate, A(t j) is the estimated cumulative hazard, and NG j) 
is the estimated survivor Fton 


This is a decreasing step function with jump at each discrete failure time. The Kaplan- 
Meier estimator can be shown to be the nonparametric MLE (see Kalbfleisch and 
Prentice, 2002, pp. 14-16). 

In the case of no censoring S(t) in (17.13) simplifies to S(t) = r/N, the number 
still at risk at time ¢ divided by the sample size, which is one minus the empirical cdf. 
To see this note that r; — dj = rj+41, if m; = 0, since then the number at risk at time 
j less the number of deaths at time j equals the number at risk at time j + 1. Then 
(17.13) becomes S(t) = Tyuj<e rj41/rj, which simplifies to r/r; where rı = N. 

The discrete-time cumulative hazard function is defined in (17.9). The Nelson— 
Aalen estimator of the cumulative hazard function is the obvious sample analogue 


ak ee d; 
A@= DY A= YS 


jltj<t jltjst "j 


(17.14) 


This estimator can also be used to estimate the survival function by Sit j= 
exp(— A(t), using the continuous case equality S(t) = exp(—A(f)). 

As an illustration, suppose that there are initially 80 observations, with 6 failures at 
time tı, 4 spells censored in [t , t2), 5 failures at time t, 3 spells censored in [f2, t3), 
2 failures at time f3, 1 spell censored in [f, t4), and so on. Then the estimates for the 
cumulative hazard and survivor function for t < f3 are given in Table 17.2. 

Tied data arise when multiple failures occur at a particular point in time. It is com- 
mon to assume that ties occur because of grouping, rather than because the process 
generates true discrete ties. The hazard estimate hy = d;/r; assumes that all deaths 
occur simultaneously at time t;. In fact deaths may occur progressively over the in- 
terval [t;,¢;41) and censoring may also occur progressively over this interval. Then 
rj overstates the number of subjects at risk on average over the interval [t;, tj41). A 
standard correction in life table analysis is is to replace he = dj,/r; by dj/(r; —m;/2), 
with similar changes in the formulas for S (t), A(t), aiid. so on. Other corrections have 
also been proposed. 

Most survival analysis programs do a good job of producing basic Kaplan—Meier 
plots and tables. Table 17.3 provides an abstract of such output for the strike data and 
complements Figure 17.1 given earlier. 
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Table 17.3. Strike Duration: Kaplan—Meier Survivor Function 


Estimates 
Beginning Survivor Standard 

Day Total Failures Function Error 
1 566 10 0.9823 0.0055 
2 556 21 0.9452 0.0096 
3 535 16 0.9170 0.0116 
4 519 17 0.8869 0.0133 
5 502 18 0.8551 0.0148 
6 484 9 0.8392 0.0154 
7 475 12 0.8180 0.0162 
8 463 12 0.7968 0.0169 
13 411 11 0.7067 0.0191 
14 400 11 0.6873 0.0195 


17.5.2. Confidence Bands for Nonparametric Estimates 


The estimate rv = d;/r; of the hazard function is very discontinuous, especially for t 
large as then r; becomes small relative to d; /r;. It can be visually useful to first smooth 
the hazard estimates, using nonparametric regression methods, see Section 9.5, before 
plotting them against time. 

The survivor and cumulative hazard functions are much smoother, and it is standard 
to plot these against time, along with confidence bands that do reflect sampling vari- 
ability. There are several ways to estimate these confidence bands. The formulas we 
give are those used in STATA. 

For the Kaplan—Meier estimate of the survivor function it is common to use the 
Greenwood estimate of the variance 


gues ~ dj 
V[S()] = SO E — +. 
Pee rj(rj — dj) 


Reported confidence intervals for S(t) are often based on In(— In S(t)) rather than 
on S(t), as this transformation ensures the confidence interval lies in the range of 
the survivor function, which is between zero and one. The transformation yields the 
100(1 — w)% confidence interval 


S(t) € (S(t) exp), S(t) exp «2, (17.15) 
where o(t) denotes the standard deviation of In(— In SO), which is estimated using 
È jigs dj/(rj(rj — d;)) 
bar < In(rj — ald] 
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Table 17.4. Exponential and Weibull Distributions: pdf, cdf, Survivor 
Function, Hazard, Cumulative Hazard, Mean, and Variance 


Function Exponential Weibull 

f@) y exp(—yt) yat”! exp(—yt*) 

F(t) 1 — exp(—yt) 1 — exp(—y t“) 

S(t) exp(—yt) exp(—yt*) 

X(t) y yate! 

A(t) yt yt” 

E[T] yi yT! +1) 

VIT] p y7/*(PQe7! + 1) - (Pe? + DP) 
y, Q y>0 y>0,œa>0 


For the Nelson—Aalen estimator of the cumulative hazard function one variance 
estimate is 
SPN d j 
VIAM]= $} 5- 
jltjst Y} 
The transformation In A(t), yields the 100(1 — w)% confidence interval for the cumu- 
lative hazard 


A(t) € [AO exp(—zarF a(t), A(t) exp(ca2F a(O)] (17.16) 
where ©; (t) denotes the standard deviation of In A(t), which is estimated using 


6 (t) = VAONA]. 


17.6. Parametric Regression Models 


We begin by outlining the properties of two distributions that perform a benchmark 
role. Then some standard regression models for duration data are considered. 


17.6.1. Exponential and Weibull Distributions 


The natural parametric starting point is the exponential, because a pure Poisson point 
process has durations that are exponentially distributed, see Lancaster (1990, p. 86). 
The exponential duration distribution has a constant hazard rate y that does not 
vary with t, the memoryless property of the exponential. It follows from (17.5) that 
S(t) = exp(— So ydu) = exp(—yt). The density is f(t) = —S'(t) = y exp(—yt), and 
the cumulative hazard A(t) = — In S(t) = yt is linear in t. 

The exponential is a one-parameter distribution that is too restrictive in practice. A 
generalization commonly used in econometrics is the Weibull distribution. Table 17.4 
presents the density and other distributional functions and moments for the Weibull and 
the exponential, which is the special case œ = 1. The function T (-) given in the Table 
17.5 is the gamma function. 
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Table 17.5. Standard Parametric Models and Their Hazard and Survivor Functions“ 


Parametric Model Hazard Function Survivor Function Type 
Exponential y exp(—yt) PH, AFT 
Weibull yate! exp(—yt™) PH, AFT 
Generalized Weibull yat! S(t)” [1 — py te} 4 PH 
Gompertz y exp(at) exp(—(y /æa Xe% — 1)) PH 


exp(—(In t—p)/207) 


Log-normal eta ation 1 — @(dnt — u) /o) AFT 
Log-logistic ayti (d+ (vt)®)] 1/fl+(n] AFT 
Gamma vey" expl- 1—I(a, yt) AFT 


rœ- yt] 


“ All the parameters are restricted to be positive, except that —oo < œ < oo for the Gompertz model. 


The Weibull has hazard A(t) = yat*~!, which is monotonically increasing if œ > 1 
and monotonically decreasing if œ < 1. This is a special case of the proportional 
hazards (PH) family, see Section 17.7.1, in which A(t) factors into a baseline com- 
ponent that depends only on ¢, Ao(t), and a second term (e.g., y) that can be pa- 
rameterized as a function of covariates only. Figure 17.2 presents properties of the 
Weibull distribution with y = 0.01 and a = 1.5. The density is right-skewed, as is 
usually the case with duration data. The shape of the survivor curve is one com- 
mon for many different distributions, making visual comparison of different estimated 
survivor curves difficult. The hazard is increasing for this Weibull example, since 


Weibull Distribution 
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Figure 17.2: Weibull distribution: density, survivor, hazard and cumulative hazard functions 
plotted against time for y = 0.01 anda = 1.5. 
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a > 1. Other parametric models can have quite different shaped hazard functions, in- 
cluding monotonically increasing, monotonically decreasing, U-shaped and inverse 
U-shaped. 

The hazard function is often imprecisely estimated in practice, especially in the 
right tail. The cumulative hazard A(t) is more precisely estimated and permits some 
discrimination across models. Even better is In A(t) plotted against Int, since for the 
Weibull model In A(t) = In y + æ Int is linear in Int with slope a. 


17.6.2. Some Parametric Models 


Popular choices for parametric models include the exponential, Weibull, Gompertz, 
log-normal, log-logistic, and the gamma. The hazard and survivor functions for these 
models are in Table 17.5. 

For the gamma, r (œ) = fee e~'t*—!dt, is the gamma function and J(q, yt) is the 
incomplete gamma function, where (a, x) = f} e't®'dt/T(a), 0 < I(a, x) < 1. 

The generalized Weibull model was suggested by Mudholkar, Srivastava, and Kollia 
(1996). Through the introduction of additional shape parameter jz in the Weibull, it 
overcomes an important restriction of that model and allows the hazard function to 
have a more flexible shape. The Weibull model is obtained in the limit as u — 0. 
From Table 17.5 note that 


Ind(t) = In(ye) + (a — 1) Int — ulin sS (6). 


Because 9 ln S(t) /dt < 0, the right-hand side of this equation is increasing in t if 
u > Oanda > 1. Ifa < 1 and u < 0, then the hazard function is monotonically de- 
creasing. If œ > 1 and u < 0, then the hazard function has two components, one of 
which is a decreasing function and the other an increasing function in t. Hence the 
two together can generate a unimodal or U-shaped hazard function. Therefore, the 
generalized Weibull is a potentially flexible and useful functional form. 

The Gompertz is similar to the Weibull as it has hazard function that can be mono- 
tonically increasing (if « > 0) or monotonically decreasing if (a < 0), with the expo- 
nential as a special case (a = 0). The Gompertz is a good model for mortality data and 
is used more in biostatistics than econometrics. 

The log-normal distribution has an inverted bathtub hazard that first increases with 
t and then decreases with t. So too does the log-logistic, for œ > 1. These models are 
clearly more appropriate than exponential, Weibull, and Gompertz for duration data 
with this property. 

Other parametric models include models based on the Rayleigh and Makeham dis- 
tributions, inverse-Gaussian piecewise continuous hazards model, and the generalized 
gamma model (Lawless, 1982), which nests the gamma and Weibull models as spe- 
cial cases. Many parametric models are presented in detail in Kalbfleisch and Prentice 
(1980, chapter 3) and Lancaster (1990, chapter 3). 

The distributions are generally two-parameter distributions. Regressors are intro- 
duced by letting y = exp(x’) with « left as a constant, but for the log-normal u = x’3 
and o? is left as a constant. 
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The main issues in parametric modeling are the dependence on correct model spec- 
ification for consistent parameter estimates and the wide range of parametric models 
that are available. Most models can be classified as either a PH model (the first four 
in Table 17.5) or an accelerated failure time model (the first two and the last three 
models in Table 17.5). The Weibull model, a member of both classes, is widely used 
in economics applications. Another widely used model, particularly for economics ap- 
plications in which many observations are available, is the piecewise constant hazard 
model, which is a special case of the PH model. 


17.6.3. Maximum Likelihood Estimation 


We now consider fully parametric analysis with independent or noninformative cen- 
soring, with estimation by ML and by least squares. The continuous duration formu- 
lation is used since parametric models are based on continuous distributions. The re- 
gressors are assumed to be time-invariant, with time-varying regressors deferred to 
Section 17.9. 

Let T* denote durations without censoring, with conditional density f(t|x, 0), 
where @ is a q x 1 parameter vector and x are regressors that can vary across sub- 
jects but do not vary over a spell for a given subject. Estimation is complicated by 
the presence of censoring. Then the observed duration ¢ is the length of a possibly 
incomplete spell, and the data are augmented by a variable indicating the presence of 
censoring, which is assumed to be noninformative. 

From Section 17.4.2, the treatment is similar to that for the Tobit model. For uncen- 
sored observations the contribution to the likelihood is f(t|x, 0). For right-censored 
observations we know only that the duration exceeded f, so the contribution is 

[o0] 


Pr[T >t] = f f(ulx, Odu 


t 


= 1 — F(t|x, 0) = S(¢|x, 8), 


where S(-) is the survivor function. The density for the ith observation can be written 
as 


Fli O” Sxi, )'™, 
where 46; is a right-censoring indicator with 


1 (no censoring), 
ôi = : i 
0 (right-censoring). 


Taking logs and summing, we have that the MLE O maximizes the log-likelihood 


In L(0) = > [3; In f(t; Ixi 0) + — 8;) In Slx, O], (17.17) 


i=1 


where independence over i has been assumed. The first term in the sum corresponds 
to completed spells and the second term to right-censored spells. Since In S(t) = A(t) 
and In f(t) = In(a(t)S@)) = In A(t) + In S(t), this log-likelihood can alternatively be 
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written in terms of the hazard and integrated hazard functions: 


N 
In L(@) = X [8; In Al [xi 0) + AG; |x;, 0). (17.18) 


i=1 


This result is useful if the parametric model is defined by specifying the hazard rate 
rather than the pdf. 

The usual estimation theory applies. The MLE will be distributed as ee 
N [9, (—E[3? In L/0000'])"'] if the density is correctly specified, see Section 5.7.3. 
If the density is incorrectly specified, however, the MLE is inconsistent. The one no- 
table exception is the exponential duration model in the absence of censoring, for 
which consistency requires only that the conditional mean function be correctly spec- 
ified; see Section 5.7.3. However, inconsistency under misspecification arises even for 
the exponential model if censoring is introduced, and it arises for other parametric du- 
ration models even without censoring. This lack of robustness is the major weakness 
of the parametric approach, just as in the Tobit model case. 

The ML approach can be adapted to permit other types of censoring. With left- 
censoring, the spell is known to be of length at most f, and the likelihood contribution 
is Pr[T* < t] = f} f(s|x, O)ds = F(t|x, 0). 

With interval-censoring the data are known to lie in [f,, tp) and the likelihood 
contribution is Pr[t, < T* < t] = Je f(s|x, Ads = S(tg|x, 0)— S(t,|xX, 0). 

Duration data in economics applications are often interval-censored. For example, 
unemployment durations may be grouped into weeks and months, yet the parametric 
model is a continuous distribution such as the Weibull. It is usually assumed that the 
effect of interval-censoring is sufficiently minor so that the interval-censoring can be 
ignored. For example, a person who is unemployed after two months but no longer 
unemployed after three months may be treated as having an unemployment spell of 
exactly three months, rather than a spell in the range of two to three months. 


17.6.4. Components of Likelihood 


Given a mix of data, with durations that may be complete, truncated, or censored in 
one of the aforementioned ways, maximum likelihood of a parametrically specified 
model requires one to set up the likelihood function. (Lancaster (1979) displays dif- 
ferent likelihood expressions appropriate for three different data setups for unemploy- 
ment durations.) Each type of observation contributes a term to the likelihood function, 
and the full likelihood is formed by taking appropriate products of terms such as the 
following (see Klein and Moeschberger, 1997, p. 66): 


complete durations: fO, 
left-truncated at t (t > t): fŒ/S(t), 
left-censored at tc, : 1-S (tc,) x 
right-censored at tc}: S (tcr) 3 


right-truncated at tc, (t < tr): f (tr) /[1 — S (tr)], 
interval-censored at tc, , tcg? S (tc,) -5S (tcr) ‘ 
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17.6.5. Weibull MLE Example 


The Weibull distribution is presented in detail in Section 17.6.1. The hazard function 
is A(t) = yat*—!, where œ > O and y > 0. 

Regressors can be introduced in many possible ways, but the usual specification 
is to let y = exp(x’Q), which ensures y > 0, while œ does not vary with regressors. 
(Some programs instead specify y = exp(—x’), which leads to a reversal in the signs 
of the estimates of 3.) Then 


In f(t|x, 6, œ) = In [exp(x’B)ar*—! exp(— exp(x’3)t")| 
= x B + lna + (a — 1) Int — exp(x’B)r® 
and 
In S(t|x, B, œ) = In [exp(— exp(x’3)t")| 
= — exp(x’B)r®. 
The likelihood function (17.17) becomes 
InL= Di [5:{x,3 + Ina + (œ — 1) Int; — exp BJ} — (1 — ôi) exp(xB)t;"] . 
: (17.19) 


The first-order conditions for G and @ are 


dInL 
35 = ye (3; = exp(x;3)t?) x; = 0, 
ainL 
= = vs (1/a + Int;) — Int; exp(x, 6t? = 0. 


Consistency clearly requires strong assumptions. For example, even with no censoring 
E[ð In L/3 6] = 0 requires E[T%|x] = exp(—x’). 


17.6.6. Use of Model Estimates 


The usual way to interpret estimates of nonlinear regression models is to consider the 
effect of regressors on the conditional mean. If y = exp(x’Q@) then from Table 17.4 
the completed Weibull durations have mean E[T*|x] = exp(—x’B/a)'(a7!+1) = 
exp(—x’3/a)l'(a~!)/a. One can calculate the expected length of completed spells at 
various values of x. For example, the length of completed unemployment for a person 
of given age, gender, and education level, say, can be predicted postestimation. 

Parametric regression models also permit prediction of aspects of durations other 
than just the sample mean. For example, interest may lie in what fraction of population 
total time in completed unemployment spells is due to spells in excess of a given length 
or is experienced by individuals in a given socioeconomic group. The econometrics of 
duration models focuses on the role of covariates but it is especially concerned with the 
shape of the hazard function, notably because some economic theories make explicit 
predictions about the shape of the hazard function. 
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Despite these possibilities, interpretation of estimates of parametric duration mod- 
els often focuses on the Weibull hazard rate A(t) = yat! and how it changes over 
time and with changes in regressors. As noted in Section 17.3.2, this hazard rate is 
increasing if œ > 1 and is decreasing if œ < 1 so that one-sided tests of a = 1 are 
obviously of interest. For changes in regressors 


dd(t)/dx = exp(x’B)at*"B = AOB, 


so that changes in regressors have the effect of a multiplicative change in the hazard 
function. A positive coefficient £ ; therefore implies an increase in the hazard rate as a 
component of x increases. Thus if 6; > 0 an increase in x; leads to an increase in the 
hazard of failure and hence to a decrease in the expected duration. 


17.6.7. Least-Squares Estimation 


Estimation of fully parametric models can be by least squares rather than MLE, simi- 
lar to the censored Tobit model. We present results, although least-squares regression 
sees little use in practice because the methods still rely on correct specification of the 
density and yet are less efficient than the MLE. 

We begin with the exponential duration regression model. Then E[T |x] = 1/y = 
exp(—x’3), so that NLS regression of t; on exp(—x;3) gives a consistent though in- 
efficient estimator for G. Alternatively, the exponential duration model can be written 
as Int = x'B + u, where u is extreme value distributed (see Section 17.7.2). Then 
E[In T |x] = x'8 — c, where c ~ 0.5722 is Euler’s constant. So 6B can be consistently 
estimated by linear regression of ln t; on x;. With right-censoring we need to obtain 
analytical censored moments, which is possible for the exponential. 

Extensions can be made using the more general results of Kiefer (1988, p. 665). He 
considers the PH model (17.21) with ¢(x’3) = exp(x’). Then 


A(t|x) = Ao(t, œ) exp(x' B). 


Then an expression for the baseline integrated hazard can be derived as follows: 


f Ms|x)ds =f Ao(s, æ) exp (x'B) ds, (17.20) 
0 0 


A(t|x) = Ao(t, æ) exp (x’B) , 
In A(t|x) = In Ao(t, œ) + x’B, 
—InAo(t, æ) = x’B — In A(t|x) 
=xß+u, 
where the error term u = — In A(t |x) is type I extreme value distributed. 

This result holds regardless of the choice of baseline hazard. We interpret this result 
in the following way. For a particular choice of baseline hazard A(t, œ), a convenient 
transformation of the dependent variable t is — In Ao(t, œ), since it can be expressed 
as a linear regression model with error term that is type 1 extreme value distributed. 


For the exponential, already discussed, In Ao(t,~) = Int whereas for the Weibull 
In Ao(t, œ) =alInt. In censored samples we obtain E[In Ao(7,@)|T > t*] using 
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results for the censored type 1 extreme value, and then follow a Heckman two-step 
procedure. These results can also be used as the basis for simple diagnostics; this topic 
is discussed in the next chapter. 


17.7. Some Important Duration Models 


Perhaps the most widely used formulation used in regression analysis of durations is 
the proportional hazard model. However, familiarity with some of its variants and with 
the accelerated failure time (AFT) models, discussed in Section 17.7.2, is also helpful. 


17.7.1. Proportional Hazards Model 


In a proportional hazard model, as previously mentioned, the conditional hazard rate 
A(t|x) can be factored into separate functions of 


A(t|x) = Ao(t, œ$ (x, B), (17.21) 


where Ao(t, œ) is called the baseline hazard and is a function of t alone, and (x, (3) 
is a function of x alone. Usually ¢(x, 3) = exp(x' 6). Polynomial baseline hazards 
are popular in the literature. 

All hazard functions A(t|x) of form (17.21) are proportional to the baseline hazard, 
with scale factor ¢(x, 6) that is not an explicit function of t. The PH model is widely 
used as the parameters 8 can be consistently estimated without specification of the 
functional form for Ao(-) (see Section 17.8). 

The exponential, Weibull, and Gompertz regression models are all PH models, since 
their hazards are, respectively, exp(x’ 3), exp(x’3)at*—!, and exp(x’ 3) exp(at). 

Another example of the PH model, used especially in applications to unemploy- 
ment durations, is the piecewise constant hazard model, which lets A(t, œ) be a step 
function with k segments so that 


dot, œ) =e", cjast<cj j=1,...,k, (17.22) 
where co = 0, ck = œ, the other breakpoints ci, ...,Cķ-1 are specified, and the pa- 
rameters &œ1, ..., Œg are to be estimated. These parameters are exponentiated to ensure 


Ao(t, œ) > 0. This model has more baseline parameters to estimate than models such 
as the Weibull, which has only one baseline hazard parameter, but can still be practical 
with a sufficiently large data set. 

The identifiability of the PH model in the presence of unobserved heterogeneity is 
discussed in Section 18.3. 


17.7.2. Accelerated Failure Time Model 


An AFT model arises by first modeling Int rather than t. A regression model is speci- 
fied for 


Int=xB+u, (17.23) 
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and different distributions for u lead to different AFT models. Since Int can take values 
on (—oo, oo) the distribution for u can be any continuous distribution on (—oo, oo). 

The term accelerated failure time arises because t = exp(x’3)v, where v = e”, has 
hazard rate A(t|x) = Ao(v) exp(x’3), where the baseline hazard i9(v) does not depend 
on ¢t. Substituting v = t exp(—x’) yields the hazard 


A(t |x) = Ao(t exp(—x’ B)) exp(x’B). (17.24) 


This is an acceleration of the baseline hazard A(t) if exp(—x’3) > 1 and a deceleration 
if exp(—x’B) < 1. 

The log-normal model for ¢ results if u ~ N[0, 07]; the log-logistic model is ob- 
tained by specifying u to be logistic distributed. The gamma model can also be ob- 
tained as an AFT model, by letting u have density f(u) = exp(au — e“)/T (a). 

The Weibull and exponential models are unique in being of both PH form and AFT 
form. The latter form is obtained by letting u be aw, where w is extreme value dis- 
tributed with density f(w) = e” exp(—e”). 

Additional duration models can be obtained by considering g(t) = xB + u, for 
transformations other than g(t) = Int. This is a member of the class of transformation 
models, which includes, for example, the Box—Cox regression model. 


17.7.3. Flexible Hazard Models 


Some models begin with specification of the hazard rate, rather than the pdf. For exam- 
ple, the hazard may be specified to be quadratic in t, such as A(t) = x8 + ait + at’. 
This permits a U-shaped hazard function. The corresponding integrated hazard is 
A(t) = (x'B)t + (a, /2)t? + (a2/3)t?. Given A(t) and A(t) we can directly form the 
log-likelihood, using the earlier result. 

The weaknesses of this approach are that negative values of A and A may occur 
and that the hazard rate may be defective as the corresponding pdf may not necessarily 
integrate to unity. 


17.8. Cox PH Model 


Fully parametric models for single-spell duration data are relatively simple to estimate 
in the presence of censoring but produce inconsistent parameter estimates if any part of 
the parametric model is misspecified. One way of resolving this impasse is to choose 
parametric functional forms that are flexible and hence provide some protection against 
misspecification. Although this is a valid approach in principle, identification and es- 
timation of such flexible functional forms is not always straightforward. An example 
is the generalized gamma model, which many users find difficult to estimate. 
Fortunately, there is a semiparametric method that requires less than complete 
distributional specification. The method differs considerably from semiparametric 
methods proposed for Tobit models, where similar issues of model robustness under 
censoring arise, as it is based on a model for the hazard rate that has no meaningful 
physical interpretation in the Tobit case. In addition, unlike the Tobit case, the method 
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is viewed as empirically so successful that it has become the standard method for 
survival data. 


17.8.1. Proportional Hazards Model 


The starting point is to propose a particular functional form for the hazard rate, the 
proportional hazard model, introduced in Section 17.7.1, with conditional hazard rate 
A(t|x) factored into separate functions of 


A(t|x, 6) = Ao(t)(x, 8). (17.25) 


As before, the function A9(t) is called the baseline hazard and is a function of t alone. 
The function ¢(x, B) is a function of x alone, where initially we consider time-invariant 
regressors x but later relax this assumption. A semiparametric model is considered, 
with the functional form for Ao(t) unspecified and the functional form for (x, 3) fully 
specified. 

The most common choice of (x, B) is the exponential form 


p(x, B) = exp p). (17.26) 


This permits coefficients to be easily interpretable, in addition to ensuring (x, 3) > 0. 
Suppose the jth regressor x; increases by one unit and other regressors are unchanged; 
then 


(t|Xnew, 3) = Ao(t) exp(x’B + B;) (17.27) 
= exp(B )A(t|x, P). 


Thus the new hazard is exp(£ ;) times the original hazard, and the change in the hazard 
is 1 — exp(£ ;) times the original hazard. If one instead uses calculus methods, the 
change in the hazard is £ ; times the original hazard, since 


AA(t|x, B)/Ax; = do(t)exp(x’B)B; = Fjal, B). (17.28) 


This is consistent with the noncalculus result as exp(6 ;) = 1 + £j. Statistical pack- 
ages often report estimates and associated confidence intervals for both 6; and 
exp(B ). 

For more general forms of #(x, 3), changes in regressors can again be interpreted 
as having a multiplicative effect on the original hazard, since 


OA(t|x, 3)/Ox = Ao(t)dAP(x, B)/Ax; (17.29) 
= A(t|x, B)x [Ib(x, B)/dx; | /O(x, B). 


This requires knowledge of 8 but not of the baseline hazard Ao(f). 

An important issue is the identification of the PH model. This is discussed in the 
next chapter in a more general setting that allows for the presence of unobserved het- 
erogeneity in the model. 
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17.8.2. Partial Likelihood Estimation 


Cox (1972, 1975) proposed a method to estimate 8 in the PH model that does not 
require simultaneous estimation of the baseline hazard function Ao(t). If desired an 
estimate of the baseline hazard can be recovered after estimation of 8B. The results 
presented here accommodate independent censoring and tied data. 

The setup resembles that in Section 17.5, with failure times ordered and catego- 
rization of observations into those that die or are at risk at each failure time. Let 
ti <t <- <tj <- < t denote the observed discrete failure times of the spells 
in a sample of size N, N > k. The risk set R(t;) is defined to be the set of individuals 
who are at risk of failing just before the jth ordered failure, D(t;) is the set of subjects 
that die at time ¢;, and d; denotes the number that die at time tj. To summarize, we 
have 


R(t;) = {l : t > tj} = set of spells at risk at tj, (17.30) 
D(t;) = {l: tı = tj} = set of spells completed at tj, 
dj = }; 1(4 = t;) = number of spells completed at t;. 


The risk set at time ¢; includes all spells that are not yet completed or not yet censored. 
Tied data are possible, in which case d; > 1. 

Now consider the probability of a particular at-risk spell ending at time t;. The 
probability that spell j is the actual spell that ends equals the conditional probability 
of failure for spell j divided by the conditional probability that a spell of any individual 
in the risk set R(t;) fails. This latter probability is the sum of the conditional probability 
of failure for each individual in R(t;). Then 


Pr|T; = t;|T; >t; 
Pr[7j = tj RE;)] = [T= tT; = ti] 
rere, Pr [T =t, > t;] 
Alx, p) 
Žera, ) 4C Xi, B) 


_ ___ xj, B) 
ier, (x1, B) 


where in the last line the baseline hazard factor Ao(t;) has dropped out, as a conse- 
quence of the PH assumption. (As a result the intercept in this model is not identified.) 
The preceding result that the baseline hazard can be eliminated provides a basis for es- 
timating B. However, we must control for tied durations that are likely to occur when 
durations are grouped. 

Ties are more likely when durations are grouped. If the data include ties (i.e., there is 
more than one failure at a given time), an adjustment is needed. For example, suppose 
there are two tied values at time f;, for individuals jı and jọ with regressors x;, and 
x;2. If jı fails before jz then the probability is 

P(X, Bf D oO) + O(&%j2,B8)/ Yo o, B), 


ER(t;) 1ERi(t; ) 


where R(t; ) equals R(t; ) with subject jı excluded. A similar term arises if j2 fails 
before jı, and the likelihood contribution is the sum of these two possibilities. The 
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exact likelihood becomes quite complicated with many tied values. A standard ap- 
proximation, due to Breslow and Peto, see Cox and Oakes (1984), is to let 


Inepe, ) P(X, B) 
dj? 
[Eiere $1, )| 


where D(t; ) denotes the set of subjects that die at time t; and dj denotes the number 
that die at time t;. This approximation works well if the nines of failures at time ft; 
is small relative to the number at risk. 

Cox defined the partial likelihood function to be the joint product of Pr[T; = 
tj|j € R(t;)] over the k ordered failure times. Then 


meD(t (Xm, B) 
paa e 
an paz ) P(x, a” 


Cox proposed estimation of G by minimizing the log partial likelihood function 


Pr[T; = tjj € R(t] > (17.31) 


(17.32) 


k 
nL =}. l D Ind&n, B) — dj n( D #0.) |. (17.33) 
j=l | meD(t;) le R(t;) 

Censored spells appear only in the second term of InL, because they do not con- 
tribute to observed deaths but, until they are censored, affect the size of the risk set. 
Equation (17.33) can be rewritten as 


N 
inL, (B) = } ô; Ç oxi, B) — In ( Vo, ) , (17.34) 
i=l le R(t) 
where the indicator variables 6; = 1 for uncensored observation and equal zero other- 
wise. 
For the usual specification of @(x, 3) = exp (x9), so that In g(x, 3) = x’ 3, the re- 
sulting first-order conditions become 


mh, a In Lp (8) yal 


— x;(B)| = 


where x} (8) = J jera) X1 €XP(X/9)/ > iera) XP(X/9) is a weighted average of the re- 
gressors x, for subjects at risk at failure time t;. 

The partial likelihood is a limited information likelihood, as the baseline hazard 
Ao(t) has dropped out, but is neither a conditional likelihood nor a marginal likelihood. 
Whether Lp (68) is a valid likelihood function has given rise to much discussion in the 
statistics literature. It can be shown (Andersen et al., 1993) that even though InL, is 
not the full likelihood function, the estimator of 8 that maximizes In Lp is consistent. 
See also Kalbfleisch and Prentice (2002, pp. 99-101) and Lancaster (1990, chapter 9). 

The Chapter 5 results on extremum estimation apply, with the simplification that 
A(QZ) = —B( 8) similar to the ML case, so that 


2 -1 
B exla (e) | (17.35) 


595 


TRANSITION DATA: SURVIVAL ANALYSIS 


The estimator is inefficient, though comparisons of the partial likelihood estimator 
with the MLE for fully parametric PH models such as the Weibull reveal relatively 
small efficiency loss. 


17.8.3. Survivor Function for the Cox PH Model 


Many studies stop at estimation of 8, being content to measure the impact of changes 
in regressors on the baseline hazard using (17.28) or (17.29). Other studies are addi- 
tionally interested in the shape of the baseline hazard function. For the PH model it is 
possible to obtain a nonparametric estimate of the baseline hazard or survivor function, 
once 6 is obtained by maximizing the partial likelihood. The estimates are analogous 
to the Kaplan—Meier estimator of Section 17.5.1. 

We obtain the PH hazard function’s associated survivor function 


SEIX, B) = Soe), 


using S(t|x, 8) =exp [ — h Ao(s)O(x, A)ds | and defining So(t) = exp [ = fe Xo 


(s)ds]. 

Now assume a discrete time formulation with baseline hazard rate 1 — a ; at discrete 
failure time t;, j =1,...,k. Some considerable algebra given in the next section 
yields estimate @; that is the solution to 


eee E mÂ, j=1,...,k, (17.36) 
) 


leD(;) | — a meR(t; 


where B is the partial likelihood estimator of 3, D(t;) denotes the subjects that die at 
time ¢;, and R(t;) denotes the subjects at risk at time ¢;. From the discussion of dis- 
crete time hazard in Section 17.3.3, the baseline survivor function So(t) = [] jiya Ci 
the cumulative product of the instantaneous conditional survival probabilities. The es- 
timated baseline survival function is then 


Sot) = TT &. (17.37) 


jlt;st 


If there are no regressors then ‘So(t) reduces to the Kaplan—Meier estimator — nor- 
malize $(x;, 3) = 1 and the expression yields hazard rate 1 — @; = d;/rj. If there 
are regressors but no ties then the expression yields baseline hazard rate 1 — @; = 


$(%), B)/ Emera) P(Xi> D). 


The survivor function for individuals with regressors x = x* can be estimated using 
Selz", B) = VPEA. 


Linear transformations of regressors do not change the estimates of 6, but they do 
change the baseline hazard function. For example, 


A(t|x, B) = A(t) exp(x’B) 
= Ao(t) exp B) exp((x — X)'B) 
= A(t) exp((x — ¥) 8), 
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where the new baseline hazard is 44(t exp((x — Xy 6). Hence subtracting the sample 
mean from each regressor will change the baseline hazard, and care is needed in inter- 
pretation of the baseline hazard or survivor function. 

Also, although the estimated baseline hazard is useful for computing and comparing 
hazard rates for specific groups of individuals, it may have a very choppy appearance, 
so some smoothing may be applied for ease of interpretation. 


17.8.4. Derivation for the Survivor Function 


We obtain the estimating equations for a; given in (17.36), following Kalbfleisch and 
Prentice (2002, pp. 114-118). 

A subject with duration time t; has likelihood contribution equal to the probability 
of survival time t > t;_ less the probability of survival time t > t;. This is 


SIX, B) — S(tj411%, B) = Solt PO — Soltis 
= (a7 'So(ti+1))9 — Soltis 


= (07 P — 1) Sot) 


using So(tj+1) = Ha a; = a j So(t;). 

For those subjects that are censored at time t; the likelihood contribution is the prob- 
ability of survival £ > tj, or So(t)41)®*. So subjects that either die or are censored in 
[t;, t;-1) contribute probability So(¢j41)®?*® = J}; ee nr ) with an additional mul- 
tiplier (eo = 1) for subjects that die. Then over all failure times the likelihood 
is 


L(a, B) = i l I Cre DO 1) I pee , 


j=l | leD(j) meR(t;) 


The log-likelihood is 


InL(a, B) = | E me -H+ E -oan Bimas |; 
j=1 


leD(t;) me R(t; ) 


Then ð InL(a, B)/da; = 0 can be re-expressed as (17.36). 


17.9. Time-Varying Regressors 


The preceding results have been restricted to models where regressors are variables 
such as gender that vary across individuals but for given individual do not vary over 
time. This is standard in other standard cross-section models such as logit and To- 
bit models. For survival data, however, individuals may be observed at several stages 
during a spell and relevant regressors may take different values over the spell. For 
example, in a medical survival study dosage levels of a medication may vary over 
time for a given individual. During an unemployment spell the rate of unemployment 
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benefits may change, perhaps in a discrete manner. During a job search the marital 
status of a person may change. 

Time-varying covariates pose two kinds of problems. First, it is clearly a misspec- 
ification to treat a time-varying covariate as a fixed variable. The entire history of 
the covariate over the spell may be relevant, a consideration that may require us to 
incorporate lagged values of some regressors as determinants of the hazard rate. Sec- 
ond, a time-varying covariate may exhibit feedback and hence may not be strictly 
exogenous as is often assumed in a duration model. For example, the duration of an 
unemployment spell may depend on the job search strategy of an individual, but the 
latter may change as the duration of unemployment lengthens. A second example is 
that the dosage level of the treatment may be varied in response to the deteriorating 
or improving condition of the patient. Deterministic time variation is easier to han- 
dle and hence standard analysis considers only the first of these issues, requiring the 
assumption that the covariates are weakly exogenous; that is, whatever the process, 
stochastic or deterministic, that underlies the time variation, we do not need to take 
account of the parameters of that process in estimating the hazard model under con- 
sideration. Some authors (e.g., Kalbfleish and Prentice, 2002, pp. 196-200) refer to 
such time variation as external. Endogenous time-varying covariates are then called 
internal. 

One rather simple solution, especially if the software cannot handle time-varying 
covariates, is to replace the time-varying covariate by its average value during the 
spell. Good software, however, allows greater flexibility. 

Consider an individual spell of (say) unemployment that lasts from the origin to 
time T, at which time a transition to employment is observed. Let 0 < ti < h <h < 
T, where tı, t2, and t3 are intermediate points within the spell. Suppose that there are 
two covariates x; and x(t) that are, respectively, time-invariant and time-varying. For 
simplicity assume that x; is binary and x2 takes the values x2(t1), x2(t2), and x3(t3) 
in a step fashion in the intervals [0, tı), [¢1, t2), and [t2, T), respectively. Also assume 
that the time-varying regressor is exogenous and/or that the pattern of time variation 
is deterministic. Then for this particular spell the data can be written as a three-line 
record, rather than a one line record, as follows: 


Observation Duration xı x2(t) Censoring Indicator 


1 ti 1 x2(tı) 0 
1 to 1 X2(t2) (0) 
1 T 1 x(T) 1 


The interpretation of this information is that we can split the total observed duration 
into three segments. During the first and the second segment the covariate values are 
(1, x2(t1)) and (1, x2(t2)), respectively, and no transition is observed (hence the censor- 
ing indicator is 0, and then in the third segment the covariate values are (1, x2(T)) and 
a transition is observed. This is akin to having three observations, in two of which the 
duration is censored and in the third duration is complete. 
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Suppose now that both the current and one lagged value of x2(f) are thought to 
be appropriate covariates. That is, the hazard rate at a point in time may depend on 
changes in a covariate earlier in the spell. Then the data array can be written as follows: 


Observation Duration xı x(t) x(t—1) Censoring Indicator 


1 ti 1 x2(tı) (0) 0 
1 h 1 X(t) X2(t1) 0 
1 T 1 xT) xb) 1 


Here we have assumed that the value of the x2(t) prior to the commencement of the 
spell was zero. Notice that in both of these examples, the covariate x2 (t) varies at 
discrete points in time. 

Although one could have multiline entries in a data set, in a large data set this 
is potentially tedious and confusing if the software ends up treating the entries as 
different observations. Fortunately, computer software can usually allow the user to 
identify a time-varying covariate as a part of the definition of the regression model. 
One can accommodate step functions or continuous functions in terms of the elapsed 
duration of the spell. 


17.9.1. Extended Cox Model 


The fixed regressor analysis of the Cox model in Section 17.8 is readily extended to 
time-varying regressors. 
In general the hazard function depends on the complete time path of regressors x(t), 
so that 
Prt <T <t+At|x(t),T >t] 


MED = fim, N 


We consider the PH form 


A(t|x(t)) = Ao(t, A)H(x(t), B), 


where the restriction is made that only the current value x(t) of the covariate matters, 
rather than the entire history of x(t). 

It is clear from Section 17.8.2 on the Cox partial likelihood approach that what 
matters at each failure time ¢; is the value of regressors x(t;) for those observations in 
the risk set R(t;). Thus for the ith subject x; is replaced by x;(t;). The partial likelihood 
has similar changes, and 

k 
InLp =>} $O moet) B) — djln| $ dca), B) 


j=l | meD(t;) LER(t;) 


Note that the form of the data is more complicated now, as there may be multiple 
observations for each subject. For example, suppose time is in discrete integer values, 
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there is only one regressor, and observation one has completed duration 25 and regres- 
sor xı, which takes value 50 in [0, 5], 100 in [6, 15], and then 200 in [16, 25]. Further, 
suppose the first five ordered failure times are 3, 8, 13, 18, and 25. Then x;(t,;) = 50, 
X1(t2) = 100, xı(t3) = 100, x1(t4) = 200, and xı (t5) = 200. 


17.10. Discrete-Time Proportional Hazards 


Grouped duration models are more appropriate when failure times are observed or 
recorded at aggregated time intervals like a week or a month. 

A simple method is to form a panel and estimate a stacked logit or probit model of 
the probability of individual failure in each period, with separate intercept for period. 
This is presented in Section 17.10.3. However, first we present the discrete-time variant 
of a continuous-time PH model, considered by several authors including Kalbfleisch 
and Prentice (1980), Fahrmeir and Tutz (1994), Kiefer (1988), and Meyer (1990). Our 
exposition follows Blake, Lunde, and Timmermann (1999). 


17.10.1. Discrete-Time Proportional Hazards 


For grouped data, with grouping points f,,a =1,..., A, the discrete-time hazard func- 
tion is defined by 


M(,|x) = Pr [ta < T < talT = ton, Xb), a=1,..., A. 


Time-varying regressors are permitted. The associated discrete-time survivor function 
is 


a-1 


S*(qlx) = PriT > tai Ix] = | [ (1 - Atex) - 
s=l 
We first obtain the general relationship between the discrete- and continuous-time 
hazards. The discrete-time hazard is the probability of failure in [tg_1, ta) divided by 
the probability of surviving to at least time tg_, so can be rewritten as 


S (ta—1|X) — S (talX) 
S (ta—1|X) 


A" (talx) = (17.38) 


where S(t|x) is the survivor function. In the continuous case S(t|x) = 
exp(— fo A(s)ds), and after some algebra (17.38) becomes 


a%(ty/x) = 1 — exp(— l “Has: (17.39) 


Now specialize to the discrete-time hazard associated with the continuous PH 
model 


A(t) = Ao(t) exp (x(ta—1)'B) , 
for t in [tg—-1, ta). Note that the regressors are constant within the interval but can vary 
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across intervals, and Ao(t) can vary within the interval. Then (17.39) becomes 


(ta lx) = | — exp(— exp (x(ta_1)'B) x B Ao(s)ds) (17.40) 


a-1 


= ] — exp (—Aoa exp (x(ta_1)'B)) 
= 1] — exp (— exp (In Roa + X(ta-1)'B)) ; 


ta ; ‘ r : i A 
where Aoa = J, , Ao(s)ds. The associated discrete-time survivor function is 


a-1 
S“(talx) = T] exp (— exp(In As + x(¢s-1)'8)) - (17.41) 
s=l1 


The density for the ith subject is the product of the survivor function in each period 
that the subject survives times the hazard at the time of failure. It follows from (17.40) 
and (17.41) that the likelihood is 


N [a-1 
L(G, Aoi; +++; A404) = TI | [I exp (— exp (In Aos + x-8) (17.42) 


i=l Ls=1 


x (1 — exp (— exp (In Aga, + X; (ta-1)' B))) . 


where censoring is ignored for simplicity and failure is assumed to occur at time ta, for 
the ith observation. At least one failure is assumed to occur in each interval [ta—1, ta). 

The MLE maximizes (17.42) with respect to G and Ao), ..., Aoa. In a special case 
partial likelihood is asymptotically equivalent to the MLE, though in general they dif- 
fer. More parsimonious models place some structure on the Aoi, ..., Aoa, Such as a 
polynomial in time. Even more structure is placed by a fully parametric model such as 
the Weibull, which sets Aos = JE l as“ lds. 


17.10.2. Han and Hausman Approach 


Han and Hausman (1990) suggested a flexible approach to recovering the baseline 
hazard that is relatively easy to implement and that predates the work of Blake et al. 
(1999) but has similarities with the work of Meyer (1990) and Sueyoshi (1992). It 
allows for considerable flexibility in the specification of the baseline hazard, A% (t), 
while maintaining a parametric form (e.g., exp(x’3)) for the function of covariates. It 
also has the merit of explicitly dealing with discrete duration data and of providing 
a framework that can more easily accommodate features of discrete data such as tied 
observations and unobserved heterogeneity. Tied observations can be a major problem 
with discrete data; for example, with unemployment durations the termination of many 
unemployment spells is likely to coincide with the end of the period of unemployment 
benefits (usually 26 weeks in the United States). 

The starting point is the hazard rate for observation i, A; (t), denoting the condi- 
tional probability that a spell terminates in the interval (t, t + A) written in the PH 
form: 


ài (T) = Ao (T) exp (—x; 8) , 
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where ào (t) denotes the baseline hazard. Then (as shown in (17.20)) taking logs after 
integration and then rearranging yields 


Ao (t)- x, 8 = £i, (17.43) 
where Ao (t) = In h ào (t) dt denotes the log of the integrated baseline hazard, and 
€i =In h 4; (t) dt. Then the probability is given by 

Ao(t)—-xB 
Pr [failure in period t] = f f (e) de. 
Ao(t-1)-x, 8 


Let y; = 1 if the ith person experiences failure in period t, and y;, = 0 otherwise. 
Then the joint likelihood of N observations is given by 


N T Ao(t)—x}B 
InL(B,Ao(1),-.., A(T) = £ X ya In i, fiode|, 17.44) 
i=l1=l Ao(t=1)—x/ 8 
and the baseline hazard parameters (Ao (1),..., Ao (T)) are estimated along with 8 


in a flexible manner (i.e., without imposing a specific functional form). 

The integral in the log-likelihood is of course the difference in the cdf [A(t — 1) — 
x; 3, Ao(t) — x; 3]. The precise form of this expression depends on the functional form 
of the cdf. If the random €; are assumed to be standard normal distributed, the log- 
likelihood takes the ordered probit form; under the assumption of extreme value distri- 
bution the log-likelihood takes the ordered logit form. To be specific, under normality 
the integral in the ith term is of the form 


PrlAg(t) < x8 + £; < Alt + DI] = Dolt + 1) — x; p) — Dolt) — xp). 


In contrast to the partial likelihood approach, which treats the baseline hazard as 
a nuisance function and eliminates it, the approach of Han and Hausman (1990) es- 
timates all the unknown parameters simultaneously at a modest computational cost. 
Their Monte Carlo results show that the method is flexible and can well approximate 
arbitrary hazard function, eliminating the need for strong functional form assumptions. 


17.10.3. Discrete-Time Binary Choice 


An alternative approach for discrete duration data is to use a binary choice model for 
transitions, since in each discrete time interval two outcomes are possible — the spell 
either ends or it does not. 

A general formulation of a discrete-time transition model is 


Pr[ta-1 < T < ta|T > to-1|X] = F (àa +X o-1)8), @=1,...,A. (17.45) 


This specification restricts the coefficients of regressors to be constant over time, 
whereas the intercept A,, a = 1,..., A, can vary over time. The obvious choices of 
the function F are the standard normal cdf or the logistic cdf. Then the parameters 
Aq and 6 can be estimated by a stacked logit or stacked probit model in which a sep- 
arate intercept is permitted for each duration interval. This method is very appealing 
because of its simplicity. 
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The resulting likelihood function is 


i=l Ls=1 


N fa;-1 
L(6, Ài, a) Xa) = JI | JI (d —F (As F xt.) | x F (Aa + X (ta,-1)8) : 


This is similar to (17.42), the log-likelihood for discrete time PH model, aside 
from the choice of function F. The hazard (17.40) is the extreme value cdf evalu- 
ated at In Ag, + X(ta—1) B, so (17.40) yields the complementary log-log model binary 
choice model (see Table 14.3) rather than the more commonly used logit or probit 
model. 


17.11. Duration Example: Unemployment Duration 


The following empirical application uses the data of McCall (1996), generously pro- 
vided to us by the author Brian McCall. The data set is derived from the January 
Current Population Survey’s Displaced Workers Supplements (DWS) for the years 
1986, 1988, 1990, and 1992. We refer to the duration measure (spell) in this exam- 
ple as unemployment duration, though more accurately it represents joblessness du- 
ration since DWS does not provide information as to whether a person is looking for 
job or not. 

For this application, information on the part-time or full-time status of the first 
postdisplacement job is required. To determine whether the first postdisplacement job 
was part-time or full-time, the following method is adopted. The first postdisplace- 
ment job is designated as part-time if a subject was still in that job at the time of the 
survey and if the subject was working less than 35 hours per week in that job in the 
previous week. 

Table 17.6 defines the key economic covariates used to explain joblessness duration. 
The number of covariates in the models estimated is quite large, but in the interest of 
brevity only a subset is listed. McCall (1996) provides a fuller description. 


Table 17.6. Unemployment Duration: Description of Variables 


Variable Name Variable Label Mean 
spell periods jobless: two-week interval 6.248 
CENSOR1 1 if reemployed at full-time job 0.321 
CENSOR2 1 if reemployed at part-time job 0.102 
CENSOR3 1 if reemployed but left job: pt—ft status unknown 0.172 
CENSOR4 1 if still jobless 0.375 
UI 1 if filed UI claim 0.553 
RR eligible replacement rate 0.454 
DR eligible disregard rate 0.109 
TENURE tenure years in lost job 4.114 
LOGWAGE log weekly earnings 5.693 
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Overall Survival Function Estimate 
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Unemployment Duration in 2-week intervals 


Figure 17.3: Unemployment duration: Kaplan-Meier estimate of survival function. U.S. data 
from 1986-92 on 3343 spells, some incomplete. 


Unemployment durations have been measured in two-week intervals. Four binary 
variables (CENSORI, CENSOR2, CENSOR3, and CENSOR4) have been introduced 
to indicate the status of the first postdisplacement job. For the analysis in this chapter 
we use CENSOR. Thus a spell is complete if person is re-employed at a full-time job. 
Another indicator variable UI is used to denote whether the subject filed an unemploy- 
ment claim or not. Replacement rate, which is the weekly benefit amount divided by 
the amount of weekly earnings in the lost job, is represented by the variable RR. “Dis- 
regard” is defined to be the threshold amount up to which recipients of unemployment 
insurance who accept part-time work can earn without any reduction in unemployment 
benefits. Disregard rate is the disregard divided by weekly earnings in the lost job. It 
is described by the variable DR in this example. As we can see, all the other variables 
are self-explanatory. 

We begin with a descriptive analysis of the duration data. The simplest first step is to 
plot the Kaplan—Meier survival curve, which is shown in Figure 17.3 by the dark line. 
The lighter lines around the estimated Kaplan—Meier survival curve represent 95% 
confidence intervals developed in Section 17.5.2. As expected, the estimated survival 
curve declines rapidly at first and then slowly. 

As we see from Table 17.7, after the first period the survival probability is 0.91, in- 
dicating that roughly 9% of the sampled individuals have terminated their spell within 
the first two weeks of beginning joblessness spell. 

In Figure 17.4, we plot the survival function by UI, that is, by whether the subject 
claims unemployment insurance or not. Again, as one can expect, it shows that those 
who claim unemployment insurance are more likely to remain unemployed than those 
who do not claim unemployment insurance. 

The Nelson—Aalen cumulative hazard in Figure 17.5 shows little variation in the 
hazard rate, which translates into an approximately linear hazard. If the crude hazard 
rate varies a lot, then the cumulative hazard would appear nonlinear. 
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Table 17.7. Unemployment Duration: Kaplan—Meier 
Survival and Nelsen—Aalen Cumulated Hazard 


Functions 

Time Survivor Function Cumulative Hazard 
1 0.9121 0.0879 
2 0.8541 0.1514 
3 0.8103 0.2027 
4 0.7864 0.2322 
5 0.7376 0.2943 
12 0.5974 0.5005 
13 0.5680 0.5496 
14 0.5270 0.6219 
26 0.3651 0.9809 
27 0.3098 1.1325 
28 0.3098 1.1325 


The cumulated hazard functions by UI recipiency, shown in Figure 17.6, exhibit 
the expected pattern: The hazard increases at a higher rate for those who do not claim 
unemployment insurance than it does for those who do. 

Next we consider four parametric regression models using the covariates UI, RR, 
DR, and LOGWAGE and the interaction terms RRUI and DRUI. The four models are 
exponential, Weibull, Gompertz, and Cox PH. Writing the hazard function as 


A(t|x) = Ao(t, VEX, B) = Ao(t, a) exp(x’B), 


Survival Function Estimates by UI Status 


[e] 
O- 
= No UI (UI = 0) 
Received UI (UI = 1) 
DN- 
= 0O 
Q 
oO 
a 
2 8 
Qo Eoo 
T = 
-2 Lew 
Z g] gan 
no 
[e] 
O- 
oT T T T 
0 10 20 30 


Unemployment Duration in 2-week intervals 


Figure 17.4: Unemployment duration: estimated survival functions by whether or not sub- 
jects receive unemployment insurance. Same data as Figure 17.3. 
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Overall Cumulative Hazard Estimate 
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Figure 17.5: Unemployment duration: Nelson-Aalen estimate of cumulative hazard function. 
Same data as Figure 17.3. 


recall that exponential hazard assumes Ao(t, œ) = constant = exp(a) for some con- 
stant a, the Weibull model assumes A(t, œ) = exp(a)at®—! (i.e., monotonic hazards), 
Gompertz assumes Ao(t, œ) = exp(a) exp(yt), and the Cox PH model has no inter- 
cept and makes no assumption about the shape of the baseline hazard. Recall also that 
the formulation here is of the proportional hazard type and can also be interpreted 
either as a parametric regression model or as an AFT model. In this parameteriza- 
tion of the likelihood function, the parameters (œ, 3) are estimated. These are given 
in Table 17.8 with the associated t-statistics. We also list the negative of the log- 
likelihood, but recall that for the Cox PH model it is the partial log-likelihood. Both 
exponential and Gompertz models fit equally well. The Weibull model provides the 
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Figure 17.6: Unemployment duration: estimated cumulative hazard functions by whether 
or not receive unemployment insurance. Same data as Figure 17.3. 
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Table 17.8. Unemployment Duration: Estimated Parameters from Four Parametric 
Models 


Exponential Weibull Gompertz Cox PH 
Var coeff. t coeff. t coeff. t coeff. t 
RR 0.472 0.79 0.448 0.70 0.472 0.78 0.522 0.91 
DR —0.576 —0.75 —0.427 0.53 0.563 0.74 0.753 —1.04 
UI —1425 —5.71 —1.496 5.67 1.428 5.69 1.317 —5.55 
RRUI 0.966 0.92 1.105 1.57 0.969 1.58 0.882 1.52 
DRUI —0.199 —0.20 —0.299 0.28 0.211 0.21 0.095 —0.10 
LOGWAGE 0.35 3.03 0.37 2.99 0.35 3.03 0.34 3.03 
CONS —4.079 —4.65 —4.358 4.74 4.097 4.65 - — 
a 1.129 
—ln L 2700.7 2687.6 2700.6 — 


best fit. As we see from Table 17.8, the fit of the Weibull model exhibits positive state 
dependence (œ = 1.129 > 1); that is, the probability of the spell terminating increases 
as the spell lengthens. 

For all the models considered, only UI and LOGWAGE are significant whereas 
other covariates are not. The estimated coefficient of UI is negative for all models, 
implying that the joblessness spell of those who claim unemployment insurance ter- 
minates slower. There is little variation of the estimates of UI across different models: 
This estimate in Weibull and Gompertz models is approximately 5% and 0.2% higher 
in absolute value than that in the exponential model, whereas it is 8% lower in the Cox 
PH model. Similarly, the estimate of the coefficient of LOGWAGE is positive for all 
the models and exhibits very little variation across models. 

Whereas in the econometric literature it is common to report the estimate of (a, 3) 
coefficients of the hazard function in AFT metric, in the biostatistics literature a differ- 
ent parameterization is often used based on the PH metric. Note that the hazard ratio 
A(t|x)/Ao(t, @) = (x, B) = exp(x’). For a categorical 0/1 scalar variable x, the im- 
pact of a change from 0 to 1 is given by exp(8) — 1, which measures impact relative to 
the baseline hazard. Numerous packages give the users an option to estimate the model 
in either or both metrics. The relative merits of the two parameterization are discussed 
in Cleves, Gould, and Guitirrez (2002). 

Consider the exponential specification in Table 17.9 where the coefficients are ex- 
ponentials of the corresponding ones Table 17.8. Here UI has hazard ratio 0.241. This 
means that belonging to the category of subjects that claims unemployment insurance 
decreases the hazard by nearly 76% over the baseline hazard. Similarly, for Weibull, 
Gompertz, and Cox PH models, the hazard decreases by about 78%, 76%, and 73%, 
respectively. 

For this example, we have taken into account right-censoring and have ignored the 
role of unobserved heterogeneity. Hence the results obtained from the three models are 
qualitatively similar. However, the relatively few included variables with significant 
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Table 17.9. Unemployment Duration: Estimated Hazard Ratios from Four Parametric 
Models 


Exponential Weibull Gompertz Cox PH 
Var B t B t B t B t 
RR 1.603 0.63 1.565 0.57 1.604 0.62 1.686 0.71 
DR 0.562 —1.02 0.653 —0.66 0570 —0.99 0.471 —1.55 
UI 0.241 —12.65 0.224 —13.12 0.240 —12.65 0.268 —11.53 
RRUI 2.626 1.01 2.760 0.99 2.635 1.01 2.416 1.01 
DRUI 0.819 —0.22 0.742 -—0.33 0.810 -—0.23 0.909 —0.10 
LOGWAGE 1.420 2.56 1.441 0.08 1.42 2.55 1.40 2.57 
a 1.129 
—ln L 2700.7 2687.6 2700.6 — 


coefficients probably indicates that large unexplained variation (perhaps caused by 
unobserved heterogeneity) may be a serious problem. This issue is considered further 
in the next chapter. 


17.12. Practical Considerations 


Most computer packages offer a good selection of computer programs for parametric 
survival analysis. Standard nonparametric Kaplan-Meier survival function estimates, 
with or without confidence intervals, with both numeric and graphic output are widely 
available. In some cases survival analysis modules are sufficiently detailed to warrant 
a special manual. For example, Allison (1995) offers a practical guide to survival anal- 
ysis in the SAS system; Cleves et al. (2002) provide a tutorial style guide to survival 
analysis in STATA. Not only do these guides explain the mechanics of implementing 
particular program commands, but in many cases they provide insightful expositions of 
the subtleties arising from specific features of data, alternative parameterizations, and 
interpretation of results. A convenient way to learn about duration data analysis is by 
using the examples in econometrics or statistical packages such as LIMDEP, STATA, 
SAS, or S-Plus. The program manuals are themselves excellent sources of information 
for standard models. 


17.13. Bibliographic Notes 


17.3-17.7 Kalbfleisch and Prentice (1980, 2002) is the classic statistical reference for survival 
analysis, with emphasis on the Cox model. Other useful sources include Lawless 
(1982) and Cox and Oakes (1984) and the considerable number of statistical texts 
on survival analysis that now exist. For a Bayesian treatment see Ibrahim, Chen, 
and Sinha (2001). Recent statistical work has increasingly emphasized the counting 
process approach, detailed in Fleming and Harrington (1991) and Andersen et al. 
(1993). 
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17.8 
17.10 


17.11 


17.13. BIBLIOGRAPHIC NOTES 


These references are very challenging, especially the latter. Lancaster (1990) 
provides a thorough treatment of survival analysis, though the presentation is quite 
technical and the book is oriented more to the general topic of transitions and mate- 
rial presented in the subsequent two chapters. For social scientists, Allison’s (1984) 
excellent exposition, like that of Lancaster, covers much more than single-spell 
survival analysis. For practitioners in microeconomics the survey article by Kiefer 
(1988) is a good start. 

Lancaster (1990) provides a thorough discussion of the partial likelihood approach. 
Meyer (1990), Han and Hausman (1990), and Blake et al., (1999) are helpful ref- 
erences on discrete hazard models. These articles generally allow for unobserved 
heterogeneity, a topic discussed in the next chapter. 

Economics applications are cited in Kiefer (1988) and in Greene (2003). Good ex- 
amples of parametric reduced-form type duration analysis are given by Lancaster 
(1979), Narendranathan, Nickell, and Stern (1985), Jaggia (1991c), and Gritz 
(1993). More recently the emphasis has shifted to computationally more complex 
structural duration models. Examples are found in Van den Berg (1990) and Ferall 
(1997). Most applications of duration analysis are reduced-form models. Economists 
have proposed structural duration models; references include Lancaster (1990) and 
Van den Berg (2001). Van den Berg also provides an interesting discussion of the 
economic theoretical foundations of the PH model. Duration data can often be ana- 
lyzed using different concepts of waiting time. Tunali and Pritchett (1997) use three 
alternatives: calendar-time, age, and duration. 


Exercises 


17-1 (Adapted from Sapra, 1998) Show that the duration data model with Pareto 
density of the first kind f(t) = wk’/t**', a > 0, t > k >, is an accelerated fail- 
ure time duration model but is not a proportional hazards model. [Hint: Show 
that In t can be expressed as a linear regression in k = exp(x’) with an additive 
homoskedastic error.] 


17-2 (Based on Lancaster, 1979). For each of the following situations develop an 
appropriate expression for the joint likelihood of N observations in terms of the 
duration density f(t|x, 8) and survivor function Qt|x, 8). 


(a) 
(b) 


(c) 


17-3 (a) 


A sample of independent completed durations, t;, i = 1,..., N, is available. 
The sample is generated as follows. Initially, individuals are selected from a 
pool of unemployed and interviewed. Subsequently, they are reinterviewed 
after h periods. Selected individuals have been unemployed for t weeks 
on selection. Between selection and interview some find jobs, and others 
do not. For those who have jobs the time of termination of unemployment 
spells is known. 


The situation is the same as in (b) except that it is not known when the 
unemployment spell ended. 


Using a 50% random sample of the McCall data set estimate the Kaplan— 
Meier nonparametric survival and integrated hazard function estimates by 
type of censoring, that is, by whether transition is to full-time or part-time 
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(c 
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employment. Does the survival function look significantly different for the 
two groups? 

Ignoring the censoring variable for type of spell termination, estimate the 
hazard model for unemployment duration under the following parametric dis- 
tributional assumptions: (i) exponential, (ii) Weibull, (iii) log-logistic, and (iv) 
Cox PH. Use the same covariates as in this chapter. 

Compare models (i)—(iii) and discuss which one you think provides the best 
fit to the data. What does each model imply regarding the duration depen- 
dence (shape of the hazard function) of a spell of unemployment? 
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CHAPTER 18 


Mixture Models and Unobserved 
Heterogeneity 


18.1. Introduction 


There is a large statistical and econometric literature concerning the topic of unob- 
served heterogeneity. Observed heterogeneity refers to interindividual differences that 
are measured by regressors, and unobserved heterogeneity refers to all other differ- 
ences. Both factors affect survival times. In the presence of unobserved heterogeneity 
even individuals with the same values of all covariates may have different hazards out 
of a given state. When unobserved heterogeneity is ignored, its impact is confounded 
with that of the baseline hazard. 

To motivate further study consider a well-known empirical example. The aggregate 
hazard rate out of unemployment is known to be a declining function of the length 
of unemployment spell. If all individuals were identical then this would imply nega- 
tive duration dependence, that is, a falling probability of escaping unemployment the 
longer an individual has remained unemployed. However, suppose that there are two 
types of individuals in the unemployed population, type F (fast), who have a constant 
hazard rate of 0.4, and type S (slow), whose constant hazard rate is 0.1. The population 
is a 50/50 mixture of the two types. Then for 100 type F people we observe 40 transi- 
tions in the first period, 24 transitions in the second period, and 14.4 in the third. For 
the type S, we observe 10, 9, and 8.1 transitions in the first, second, and third periods, 
respectively. Hence the aggregate proportion of transitions will be (40 + 10)/200 = 
0.25, (24 + 9)/150 = 0.22, and (14.4 + 8.1)/117 = 0.192. This shows that the de- 
clining aggregate hazard is a consequence of aggregation across heterogeneous groups, 
which themselves have constant but different hazard rates. Accurate statements about 
duration dependence require that models incorporate unobserved heterogeneity. 

In linear regression models there are no complications caused by unobserved het- 
erogeneity if the heterogeneity is independent of regressors. In that case the conditional 
mean is unchanged, the unobserved heterogeneity is absorbed into the error term, 
and there is no omitted variables bias. In contrast, unobserved heterogeneity usually 
causes problems in durations models. In the simplest models, such as the exponential 
model, it is possible to specify multiplicative unobserved heterogeneity uncorrelated 
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with regressors that leaves the conditional mean duration unchanged. However, even 
in this simple case the conditional hazard function does change, and it is the hazard 
that is modeled out of necessity, given the presence of censoring and given also, for ex- 
ample, the interest of policy makers in determining how exit rates from unemployment 
vary with length of unemployment spell. 

The role of unobserved heterogeneity lies at the heart of numerous empirical puz- 
zles and conundrums. Although our focus in this chapter is in the context of duration 
models, most of the issues are of more general relevance. The material and techniques 
used here are also relevant to all econometric models, since all econometric models 
omit some individual-specific unobservable variables from the model. Leading exam- 
ples in other chapters include random parameters logit (Section 15.7), sample selection 
(Section 16.4), finite mixture for counts (Section 20.4) and fixed and random effects 
models for panel data (Chapters 21—23). These factors go under the collective heading 
of unobserved heterogeneity. In biostatistics the term frailty is also used. In actuar- 
ial studies (multiplicative) unobserved heterogeneity measures proportional increase 
or decrease in the hazard rate (“force of mortality”) operating on a given individual 
relative to that on an average individual. Individual-specific heterogeneity need not be 
time-invariant, but in cross section models it is convenient to assume it is. 

It is important to consider the consequences of such an unavoidable misspecifica- 
tion. From ordinary linear multiple regression analysis it is known that such an omis- 
sion in general can lead to an omitted variable bias. In duration models, which are 
nonlinear, the analysis of unobserved heterogeneity is more complex. Introduction of 
unobserved heterogeneity leads to an important class of models called mixture mod- 
els, this being one of the many names for this class. The subject matter of this chapter 
concerns both the techniques for generating and analyzing mixture models and the 
substantive consequences of omitted heterogeneity. 

Distinguishing between heterogeneity and true state dependence has been a long- 
standing issue that can be traced back in history to discussions concerning true and ap- 
parent contagion. Neyman has been credited for his early insight that longitudinal data 
may be essential to make this distinction empirically possible. When, however, only 
cross-section data are available, there is a tendency to rely heavily on strong parametric 
assumptions. The emphasis in the recent literature has been to free empirical analysis 
from such assumptions and on testing the validity of maintained model assumptions. 

The first part of this chapter, Sections 18.2—18.4, deals with mixture models based 
on continuous distribution of heterogeneity. Section 18.5 presents models based on 
discrete heterogeneity. Section 18.6 considers relationships among different duration 
concepts from flow and stock data. Tests of misspecification and neglected heterogene- 
ity are dealt with in Section 18.7. An empirical example in Section 18.8 illustrates 
several of the ideas developed in the chapter. 


18.2. Unobserved Heterogeneity and Dispersion 


In this section we focus on unobserved heterogeneity in the exponential and Weibull 
models. We consider a form of multiplicative unobserved heterogeneity that, after 
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being integrated out, leaves the conditional mean unchanged but does inflate the con- 
ditional variance and, more importantly, changes the conditional hazard function. The 
popular Weibull model with gamma distributed heterogeneity is also presented. 


18.2.1. Mixtures 


The simplest model to consider is the exponential duration model. In an exponential 
regression without heterogeneity the distribution of complete spells, t;, is specified 
conditional on observable weakly exogenous covariates x;. This is equivalent to spec- 
ifying the conditional mean function as nonstochastic: E[T |x] = exp(x’). In mixture 
models we instead specify the distribution of (t;|x;, v;), where the additional v; denotes 
an unobserved heterogeneity term for observation i. Simply, individuals are assumed 
to differ randomly in a manner not fully accounted for by the observed covariates. The 
marginal distribution of t; is obtained by averaging with respect to v;. 

The precise functional form linking t; and (x;, v;) must be specified. A commonly 
used functional form is the exponential mean with a multiplicative error. For example, 
consider the PH model with unobserved heterogeneity. From Section 17.8 we have the 
proportional hazards model, (17.25) and (17.26), which can be extended to include a 
multiplicative term v. That is, 


A (tix, v) = Ao(t) exp(x’B)v, v > 0, 
and hence we can obtain an expression for integrated baseline hazard as follows: 


Ao(t) = A (t|x, v) exp(—x’B)v7!, (18.1) 
| Ao(u)du = exp(—x’B)v—! f à (ux, v)du, 


In p haudu! = —x' ß — lnv +e, 


where € = In f à (u|x, v) du, and v is assumed to be independent of the regressors and 
of censoring time. A common normalization restriction is E[v] = 1. When v > 1, the 
hazard rate is greater than for the average subject; it is less than that for the average 
subject if v < 1. The independence assumption is strong and not necessarily realistic. 
The multiplicative heterogeneity assumption is also rather special, but it is mathe- 
matically convenient and more attractive than an additive error, which could violate 
nonnegativity of ¢;. A standard approach involves postulating a distribution for v;, and 
then deriving the marginal distribution of t;. 

Multiplicative heterogeneity has two important and related consequences. Not sur- 
prisingly, the variance of the mixture (conditional on the observable variables) exceeds 
the variance of the parent distribution (conditional on both the observables and het- 
erogeneity). That is, the variance gets inflated. Consider the exponential mean case. 
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Replace u; = exp(x; 3) by 
ui = Elti |x; vi] (18.2) 
= exp(x; 3); 
= exp(x; 3) exp(e;) 
= exp(Bo + £i + X1;}), 


where the unobserved heterogeneity term v; is redefined as exp(e;) in the third line, 
and the term x; is broken into the intercept and slope terms in the last line. The 
last line has an interpretation as a conditional mean with a randomly varying intercept 
(Bo + £;i). It is usually assumed that v;s are iid, possibly with a known parametric 
distribution, and that they are independent of the x;. 

Assume that v; is iid with E[v;] = 1 and V[v;] = ae. The assumption that E[v;] = 1 
permits identification of the intercept. For the exponential density, the moments of 
ti can be derived as E[t;|x;, vi] = Hivi, and using Section A.8 result on variance 
decomposition, 


V [fxi] = Vo [Enyx(ilvi, X] + Ev[EVi,x@i li, Xi] (18.3) 
= WV (i) + uF (Vy) + D) 
= mi [1 +20] 


2 
> Hi. 


The unconditional variance is inflated by unobserved heterogeneity. 


18.2.2. Choice of Heterogeneity Distribution 


Consider how the distribution of t is affected by heterogeneity. This requires us to 
look at the marginal distribution of t; by integrating out the heterogeneity term, v, 
from S(t|x,v). A parametric distribution of v is usually specified. What considerations 
apply to choosing this distribution? 

To respect the property v; > 0, we may specify a distribution with support on the 
positive line. Examples are gamma, inverse Gaussian, and log-normal. 

The gamma density is 

gv; ô TEL v>0 (18.4) 
Ai r(k) : i l 

which has E[v] = k/ô and V[v] = k/65*. Normalization sets k = 6, E[v] = 1, and 
V[v] = 1/8. The gamma assumption is mathematically convenient. It is also employed 
in a number of popular software packages for duration modeling. 

The inverse-Gaussian density is 


g(v; ô, 0) = ôn! exp (2501/”) v3? exp (—9v = 5°/v) , v>Od, (18.5) 


which has E[v] = 60~!/? and V[v] = 6077/2/2. Normalization 6 = ô? yields E[v] = 
1, and V[v] = 1/20. Relative to the gamma the inverse-Gaussian distribution has more 
tail probability. 
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These will not necessarily produce an analytically tractable marginal distribution of 
t. As we will show, some combinations such as exponential and gamma, or Weibull 
and gamma, lead to closed-form marginals, whereas others do not. However, this con- 
sideration is one of mathematical and computational convenience only and hence is 
not necessarily compelling on its own. Unfortunately, one rarely has guidance from 
economic theory on this aspect of duration modeling. 

A second consideration is generality and flexibility. The gamma model is quite 
flexible and has many attractive properties. However, the inverse-Gaussian may better 
handle heavy-tailed distributions. Both of these are one-parameter families (after nor- 
malization). Hougaard (1986) introduced a more flexible two-parameter family that 
has gamma and inverse-Gaussian as special cases. Later in this chapter we consider a 
discrete (nonparametric) representation that also affords considerable flexibility. 


18.2.3. Weibull-Gamma Mixture 


Next we consider the popular Weibull-gamma mixture, which can be specialized to 
the exponential-gamma case. This model is a leading special case of a mixed propor- 
tional hazards (MPH) model. The Weibull-gamma mixture is, of course, of indepen- 
dent interest because of its greater generality, and especially because it will be shown 
to encompass both increasing and decreasing hazards. 

The survivor function conditional on multiplicative v for the Weibull model is 


S(t|v) = exp(—ut* v), à >0,« > 0, (18.6) 


where u replaces œ used in Chapter 17. 

The unconditional survivor function is given by the average survivor function. Aver- 
aging across the heterogeneous population using the density of v, g(v), as the weight- 
ing function yields, 


S(t) = E [S(t|v)] = I S(t|v)g(v)dv. (18.7) 


Different choices of g(v) lead to different mixtures. With appropriate changes in in- 
terpretation both continuous and discrete distributions are valid. The integral in (18.7) 
may not have an analytical solution. For example, if g(v) is the log-normal density the 
integral does not have an analytical solution but if it is a gamma distribution it does. 
For mathematical convenience we work with the gamma case in what follows. 

Given gamma heterogeneity the unconditional survivor function is 


dk yk! exp(—dv) 


S(t)= exp(— ut“ v) dv (18.8) 
o P TG) 
Ff exp(—vutt +d 
= —— v“ exp(—v v. 
r® Jo A 
To obtain the mixture density we solve the integral. Letting ut* + ô = B, we get 
5k oo (vp)! 
S(t) = rw Jy pel exp(—vB) dv. 
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Define y = v£, so that dv = B~'dy and 
bk o0 pai 
or | y` exp(—y) dy 
s& (kK) 
T TŒ (ute + 5) 
= 5*(ut® + 6)" 
= [1 + (ut? /5)], (18.9) 


S(t) 


where the second line is obtained using the definition of T (k) and substituting for £. 
The unconditional duration density function is obtained by differentiating with re- 
spect to ¢ and multiplying by —1, which yields 


foO= É pamp + (utt AEP. (18.10) 


The unconditional hazard function A(t) = f (t)/S (t) is given by 


A0) = É pam + (t A. (18.11) 


These general expressions can be specialized by setting the mean of v at 1; that is, 
set k = 6, which normalizes E[v] = 1, and leads to the following expressions for the 
Weibull-gamma mixture: 


SQ) = [1 + (ut À, (18.12) 
fOSs ae = pot? [1 + (ut /8) CtP, (18.13) 
At) = ae z5 = nat! + (ut®/d)\', (18.14) 


which tends to the Weibull hazard as the variance 1/65 goes to zero. 

The Weibull model permits either increasing or decreasing hazards but somewhat 
restrictively assumes conditionally monotonic hazards at the individual level. Yet this 
mixture distribution has been popular in the econometrics literature, mainly because 
of its convenient properties; see Lancaster (1979) and Narendranathan, Nickell, and 
Stern (1985). 

To specialize the results to the exponential-gamma mixture set a = 1. 
This yields S(t) = [1 + (ut/8)}-%, f0) = El + (ut /8)}-O*, and A(t) = wf + 
(ut/5)]~!. The exponential-gamma mixture, also known as the Pareto distribution 
of the second kind, has more mass in the tails relative to the exponential. The dif- 
ference between the two depends on the variance, 1/5. The rth moment exists only 
ifô >r. 
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18.2.4. Interpreting the Mixture Hazard Function 


An important issue in economic applications is whether positive or negative duration 
dependence is present in duration data. For example, does the probability of exiting 
from unemployment increase (e.g., owing to worker is reservation wage falling) or 
decrease (e.g., owing to the worker being viewed as damaged goods) as the length 
of the unemployment spell increases? In the iid case this can be easy to establish by 
nonparametric estimation methods. With non-iid data, however, a decreasing hazard 
in the raw data may be due to aggregating across different individuals, each of whom 
has a different constant hazard rate, or to an decreasing hazard for each individual. 
Distinguishing between the two can be difficult. 

Consider the problem of interpreting the hazard function in the presence of unob- 
served heterogeneity in the exponential-gamma mixture. Notice that even if individual 
hazard (i.e., hazard conditional on v) is constant at jz, the average or aggregate hazard 
A(t) is declining in t. This does not mean that there is negative duration dependence 
in the individual hazard rate. Rather, it is the effect induced by aggregation across in- 
dividuals who differ randomly in their hazard rates. A similar erroneous interpretation 
can occur in the Weibull—gamma case. In that case the actual slope of the hazard func- 
tion depends on a, but the slope of the average or aggregate hazard function is affected 
by the presence of heterogeneity. Thus the neglect of unobserved heterogeneity may 
lead to underestimation of the slope of the hazard function. This result seems fairly 
general (see Lancaster, 1990). Salant (1977) provided an early extensive discussion of 
this phenomenon. 

This result is the basis of the claim (see, for example, Lancaster, 1979; Heckman 
and Singer, 1984a) that the estimation of hazard function in the presence of neglected 
unobserved heterogeneity may lead to serious biases. Our discussion motivates tests 
of unobserved heterogeneity in hazard models. Let us examine the argument in the 
context of the Weibull mixture model for which S(t) = f exp (—pt*v) g(v)dv. The 
aggregate hazard function is 


Mt) = -f ESUP eco) 


a1 [ vexp(—ut*v) 
— t eo d 
ci J sa. NAA 
= aut™ E [v]T > t]. 


Because E[v|T > t] is the average of v over those surviving at time f, it must decrease 
with time as individuals with higher values of v leave the state sooner than individ- 
uals with low values of v. This changes the slope of the aggregate hazard function. 
This phenomenon can also be thought of as a form of selectivity bias (Chapter 16.5). 
Formally, the average of v over time can be written as 


Emr >= f REE” gear. 
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Therefore, for the Weibull mixture model 


EWT >t] _ Pee f v? exp(— ut” v) | 
F = -—aut | — a g(v)dv 


—ut® 2 
ape! ll rA toan 


= ~at! {E[v*|T > t] -—(E[viT = t} 
= —aut™!V [vT > t] (18.15) 
<0. 


Hence, neglecting heterogeneity results in an estimated hazard rate that is falling faster 
or rising more slowly than the actual hazard rate. 

Another interesting comparison between models with and without heterogeneity is 
the proportional impact of a change in a covariate on the hazard rate. In the absence of 
heterogeneity 


In Alta) = In (wt*') + Ina, 
and the proportional impact of a change in x; on yu is 


DMA) 
OX; arge 


which is a property of the proportional hazard model. 
Allowing for unobserved heterogeneity 
InA(t|u) = In (ut!) + Ina + InE[v|T > t] 
= lng + lnu + (@ — l)lnt + mnE[v|T >t], 


whence, noting that In u = x’G and dE[v|T > t] /dx; = —ut”V[v|T > t] B;, it fol- 
lows that for the Weibull mixture model 


ƏlnAltlu, v) ut?V [v|T >t] 
Ox; ey E[vi|T >t] 


(18.16) 


The result shows that given heterogeneity the proportional impact of a change in x; is 
smaller and depends on ¢ and is no longer of the proportional hazard type. Thus, the 
estimates derived from the model may be misleading even if the unobserved hetero- 
geneity term is uncorrelated with the included covariates. 

Similar consequences of unobserved heterogeneity for models more general than 
the Weibull are discussed in Lancaster and Nickell (1980). 


18.3. Identification in Mixture Models 


Associated with mixture models is a general identification problem. This issue con- 
cerns the logical possibility of decomposing the individual contributions to the average 
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survival probability of the baseline hazard, the unobserved heterogeneity, and the co- 
variates, given the observed data (t, x) pertaining to a single spell. More specifically, if 
the PH model were not identified, then it would be logically impossible to separate the 
individual contributions of duration dependence and unobserved heterogeneity. As in 
most discussions of identification, some restrictions are placed on the formulation. In 
econometric literature the case of (mixed) proportional hazards has been investigated 
in detail. Heckman and Singer (1984b) and Elbers and Ridder (1982) have established 
the identification of the MPH model under certain conditions. Van den Berg (2001) 
provides an excellent discussion of these earlier proofs as well as later contributions. 

Discussions of identifiability of the MPH model begin with the average or aggre- 
gate survivor function 


S(t|x) = E, [S (¢|x, v)] (18.17) 
= J ovado goav, 


which assumes proportionality of hazards as in (18.1), uses the PH formulation of 
Section 17.8, but does not make parametric assumptions on Ao, @¢, or g. Here Ao(t) = 
ie Ao (s) ds. The model is said to be nonparametrically identified if, given the data, the 
functions Ao, g, and ¢ are unique. We add the qualifier “nonparametrically” because 
of the absence of functional form assumptions. 

Variations in observed survival times are due to variations in the covariates x, in 
v, and in the duration dependence function (baseline hazard). Identifiability means 
a unique decomposition of the variation. A proof of identifiability must show that 
these separate contributions are in principle identifiable. Most of the available proofs 
use advanced mathematical tools to show that the likelihood function can be uniquely 
decomposed. Melino and Sueyoshi (1990) provide a simpler proof. 

The conditions required for nonparametric identification include the following: 
(i) The heterogeneity term v is assumed to be time-invariant and independently dis- 
tributed of x. (ii) g(v) is nondegenerate and has finite mean (i.e., E[v] < 00). (iii) 
(x) > 0 for all x. (iv) Ao(t) is continuous and positive on [0, 00). (v) Observed ex- 
planatory variables x are linearly independent and have sufficient variation. Different 
proofs have some subtle variation on these conditions but we will not delve into these 
here. 

Whereas the issue of nonparametric identification involves considerable mathemat- 
ical subtleties, the problem is also relevant in the context of parametric models. If one 
specifies parametric forms such as Ao(t|a), (x|B), and g(v|y), then are these func- 
tions unique given the data? The answer, unfortunately, may be “no” in many cases. 
This means that one investigator may estimate a particular mixture model with no 
computational problems, and apparently “nice” results and meaningful coefficients. 
However, this representation may not be unique. Another investigator may produce 
equally nice results under different parametric assumptions and with different impli- 
cations. That is, the observed survivor function may be consistent with other choices 
of the baseline hazard and heterogeneity distributions (Lancaster, 1990, chapter 4). In 
the terminology of Section 2.2, different structural models, with substantively different 
policy implications, may have the same reduced form. This clearly poses a problem for 
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parametric applied work. One appealing solution is to choose flexible parametric forms 
for hazard and heterogeneity, or else to take the semiparametric approach of partial 
likelihood analysis. The discussion of this issue continues in the next section. 


18.4. Specification of the Heterogeneity Distribution 


The sensitivity of coefficient estimates to alternative assumptions about the hetero- 
geneity has been extensively discussed in the literature. Two apparently contradictory 
positions may be discerned: 


1. Parametric specifications of unobserved heterogeneity are often somewhat arbitrary. 
They may seriously distort inferences about the hazard function. Hence a parametrically 
flexible or nonparametric specification is desirable. See Heckman and Singer (1984a). 


2. Parametric specifications of unobserved heterogeneity are relatively innocuous if the 
baseline hazard function is correctly specified. When the specification of the hazard 
function is in doubt and/or is incorrect, the estimates produced using different para- 
metric assumptions for heterogeneity may lead to different estimates of the marginal 
distribution of the data. See Manton, Stallard, and Vaupel (1986). 


The apparent contradiction between the two positions may be resolved as follows. 
The specification of the hazard function affects the first moment of the distribution 
of f(t), whereas that of heterogeneity affects its second moment, assuming that it is 
uncorrelated with the observed covariates. If the hazard function is correctly speci- 
fied, then the main impact of the heterogeneity distribution would be on the relative 
efficiency of the estimator. 


18.4.1. Discrete-Time PH with Gamma Heterogeneity 


The preceding considerations suggest that a proportional hazard formulation with an 
arbitrary hazard function makes an attractive model with which to combine a specific 
heterogeneity assumption. Han and Hausman (1990) and Meyer (1990) combine the 
gamma heterogeneity assumption with the discrete-time proportional hazard model 
developed in Section 17.10. They report that when the baseline hazard is not parame- 
terized estimates show little sensitivity to alternative functional forms for g(v). 

For specificity reconsider (17.43) after including a heterogeneity term: 


Ej =1n( fwaz) -xb — vi, 


which can be substituted into the expression for log-likelihood (17.44). The het- 
erogeneity term needs to be integrated out. Han and Hausman give a closed-form 
expression under the gamma heterogeneity assumption and report results that indi- 
cate relatively minor sensitivity to parametric assumptions given their flexible hazard 
specification. 
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18.4.2. Some Other Models for Heterogeneity 


The preceding discussion emphasized the computational convenience of Weibull- 
gamma model, which has a closed form. 

If the tail of the observed marginal distributions is thicker than is consistent with the 
gamma or log-normal, one may consider a member of the Mandelbrot stable family of 
distributions. Hougaard (1986) proposed a very general family that nests, for example, 
the gamma and inverse-Gaussian families (also see Jaggia, 1991b). A strictly stable 
distribution obeys the condition that the sum of p independent realizations should 
have the same distribution as a scale factor times the distribution. Hougaard (2000, 
appendix 3.3) provides a summary of its properties. 

Although a more highly parameterized heterogeneity distribution looks attractive 
because of its greater generality, it may lead to two kinds of problems. The first prob- 
lem is that the available data may not be sufficiently rich to allow us to identify or 
precisely estimate the parameters. Often this situation cannot be recognized without 
attempting estimation in the first place. 

The second problem is computational. If the mixture density does not have a closed 
form, it is then left in the form of an integral. The resulting likelihood function has 
terms that are also integrals. Estimation requires the use of computer-intensive nu- 
merical methods such as numerical or Monte Carlo integration that were discussed in 
Chapter 12. An example of a mixture model that requires such estimation techniques is 
the Weibull—log-normal mixture in which unobserved heterogeneity has a log-normal 
distribution. Simulation-based estimation of heterogeneity models is discussed by 
Gouriéroux and Monfort (1991, 1996) and considered as an example in Section 12.2. 


18.5. Discrete Heterogeneity and Latent Class Analysis 


The preceding analysis assumed a continuous distribution of unobserved heterogeneity 
and concentrated on estimation of the parameters of that distribution. 

An alternative approach assumes that the sample of individuals is drawn from a pop- 
ulation that consists of a finite number of latent classes, say q, and that each element 
in the sample can be regarded as a draw from one of these q latent sub-populations 
or strata. This model is known variously as the finite mixture model, semiparametric 
heterogeneity model (Heckman and Singer, 1984a), and latent class model (Aitken 
and Rubin, 1985). Its attractive feature is that it leads to a flexible parametric distri- 
bution. In duration modeling the model has been analyzed, advocated, and applied by 
Heckman and Singer (1984a). 

Although these popular models are presented in the context of duration models, a 
general notation is used to emphasize the potential for application elsewhere; see, for 
example, Section 20.4. 


18.5.1. Finite Mixture Model 


Consider the following two-component finite mixture model. If the sample is a proba- 
bilistic mixture from two subpopulations with pdf fı(t|u1(x)) and f2(t|u2 (x)), then 
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mfi-)+Ud —2)fo(-), where 0 <x <1, defines a two-component finite mixture. 
That is, observations are draws from fı and f2, with probabilities x and 1 — 7 , respec- 
tively. The parameters to be estimated are (7, u1, (42). The parameter x may be treated 
as constant or may be further parameterized using, for example, the logit function. 
Thus z = exp(A)/[1 + exp(A)] and A in turn may be parameterized in terms of further 
observable covariates. Thus we think of two types of individuals, those that come from 
J iC) and those that come from f>(-). Perhaps there may be an a priori case for thinking 
along these lines, for example if there is some latent characteristic that partitions the 
sampled population in this way. An alternative interpretation is simply that the linear 
combination of densities makes a good approximation to the observed distribution of t. 

Generalization to additive mixtures with three or more components is in principle 
straightforward but subject to potential problems of the identifiability of the compo- 
nents. This is discussed further later in the chapter. Therefore, it is very helpful in 
empirical application if the components have a natural interpretation. At the simplest 
level we think of each subpopulation as a “type,” but in many situations a more infor- 
mative interpretation may be possible (Lindsey, 1995). 

Another interpretation of the finite mixture model is in terms of a discrete represen- 
tation of population heterogeneity. Suppose the population consists of m homogeneous 
subpopulations, usually called components. A parametric model, such as the Weibull 
or exponential, is supposed to apply to each component. Assume that the jth compo- 
nent is a fraction 7; of the total population, }° 7; = 1. 

Formally, the problem is formulated as follows: In all previous examples the dis- 
tribution of the unobserved heterogeneity term has infinite points of support. If the 
continuous mixing distribution g(v;) can be approximated by a discrete distribution, 
denoted by x; (j =1,...,m) with a finite number, m, of support points then the 
marginal (mixture) distribution is 


h(t)|x;, zj, B) = È fal v, O0) (18.18) 
= 


where v; is an estimated support point and 7; is the associated probability. This semi- 
parametric representation of unobserved heterogeneity was examined by Heckman 
and Singer (1984a) in duration modeling. Closely related work is that of Wedel et al. 
(1993), where the latent class interpretation is favored. If the mixing distribution 7; is 
not subject to any parametric assumptions, then the mixture model is called a semi- 
parametric mixture model for t. 

The estimation of the finite mixture model may be carried out under the assumption 
of either known or unknown number of components. If the fractions x; are known, 
maximum likelihood estimates of the component distributions can be estimated. More 
usually the proportions z;, j = 1,...,m, are unknown and the estimation involves 
both the x; and the component parameters. The maximum likelihood estimator for the 
latter case is called nonparametric maximum likelihood estimator (NPMLE). Here the 
nonparametric component is the number of classes, but it is strictly a semiparametric 
method because it is combined with parametric models for the components. If the 
number of components is unknown, as is usually the case, then some delicate issues of 
inference arise. See Section 18.5.4 for details. 
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An obvious motivation for the finite mixture class is that it is a natural and sim- 
ple way to treat population heterogeneity. In many situations it is simpler to think of 
unobserved heterogeneity in terms of a small number of latent classes rather than a 
continuum of “types” as in Section 18.2. 


18.5.2. Latent Class Interpretation 


The finite mixture model is related to latent class analysis (Aitkin and Rubin, 1985; 
Wedel et al., 1993). Let d; = (di1, ..., dim) define an indicator (dummy) variable such 
that dj; = ice dij = 1) indicates that t; was drawn from the jth (latent) group or 
class fori = 1,..., N. That is, each observation may be regarded as a sample from 
one of the m latent subpopulations, classes, or “types.” In the discussion that follows 
we assume that the model is identified. 

The model specifies that (t;|d;, p, m) are independently distributed with densities 


D dy files) = Tl fale”, (18.19) 
j=l j=l 
where uj = (x;,3;), M=(H1,---, Um), and (dj| p, T) are iid with multinomial 
distribution 
[I xi", O<aj <1, 2gs 1. (18.20) 
j=l j= 


The last two relations imply that 
iid Sadi ; 
Glan) ~ ay fila”, 
J= 
which leads to the likelihood function 


LG, nid = TI $ r fy(ts uj". (18.21) 


i=1 j=1 


18.5.3. EM Algorithm 


This likelihood function may be maximized directly or by applying the EM algorithm 
in which the variables d = (d;,...,d,) are treated as missing data; see Section 10.3. 
If the d were observable the log-likelihood of the model would be 


In L(y, zt, d) = > 3 dij ln fj (tis uj) + > 3 dij ln 7j. (18.22) 
i=1 j= i=l j= 
If z;, 7 =1,...m, are given, the posterior probability that observation t; belongs to 


the population j, j = 1,2, ...m, denoted z;j, is given by 


zj fiili Bj) 


Zj= Pr[y; E€ population j] = m i 
j D1 7) FiO, Bj) 


(18.23) 
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The average value of z;; over i is the probability that a randomly chosen individual 
belongs to subpopulation j. This equals 7; : 


E[zi;] = Tj. 
Suppose we have available an estimate Z; j of E[d;;]. Then, conditional on this estimate 
we have 
m 


N 
Zij ln fiti uaj By) FD, ay ny, 
1 i=1 j=1 


EL(6;, -<-s Bms tlt, X], sae Xap) = 


Mz 
TMs 


1 


(18.24) 


which constitutes the E-step of the EM algorithm. The M-step of the algorithm maxi- 
mizes EL by solving the first-order conditions 


#;-N") Zj =0, j=l,...,m, (18.25) 
i=l 


dln fiG1A;) 
Next we can use (18.23) to get new values of Z;; and iterate through the E- and M- 


steps. Once the process converges the variances can be computed using either the in- 
formation matrix or the robust formula. 


Nm 
X D Zij 0. (18.26) 
i=1 j=1 


18.5.4. Choosing the Number of Latent Classes 


The first important issue concerns the choice of m, the number of components. Often 
there is no guiding prior theory and the choice is usually made on pragmatic grounds. 
Because the dimension of parameters to be estimated is m dim[ 6] + m — 1, the num- 
ber of parameters can be quite large. This number can be decreased somewhat if some 
elements of 6 are restricted to be equal. One popular method involves allowing the in- 
tercept to vary but restricting the slope parameters to be the same across groups (as in 
(18.18)). However, there is clearly an incentive to keep m small if all parameters are al- 
lowed to vary across classes. Even when only the intercepts are allowed to vary, many 
applications use m = 2. A sensible strategy is to start with m = 2, and then check the 
fit of the model using diagnostic tests. An additional component is added if the fit is 
poor. Adding components that cannot be reliably differentiated is problematic. When 
intergroup differences are small, the finite mixture representation is not needed. The 
most desirable situation is one in which the components have an interpretation. Choice 
between models of different dimensions can be made using the penalized likelihood 
criterion (AIC or BIC), see Section 8.5.1. The likelihood ratio test is not appropriate 
because of the parameter boundary hypothesis problem. Baker and Melino (2000) de- 
scribe a Monte Carlo experiment that dramatically reveals the potential pitfalls of over- 
parameterization in a model in which both duration dependence and heterogeneity are 
flexibly specified owing to a desire to avoid misspecification. For model selection they 
recommend comparing a penalized likelihood criterion across candidate latent class 
models, with a high penalty for more parameters. 
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When the model is overparameterized the parameters cannot be identified. The 
problem may manifest itself by the presence of multiple optima or a flat likelihood 
surface. The computational algorithm may converge to different points depending on 
the starting values. 

A model selected from competing models using the penalized likelihood criterion 
may not necessarily describe the sample data well. This can only be ascertained by 
a suitable goodness-of-fit test and model diagnostics. Essentially one compares the 
actual and fitted distribution of durations; a significantly large deviation between the 
two indicates that the systematic component of the model does not adequately explain 
the observed sample variation. Some possibilities are considered in the next section. 


Computational Considerations 


A second issue concerns the choice of computer algorithm. Whereas the EM algorithm 
is very helpful in understanding the computational structure of the problem, in practice 
it often tends to be slow. The authors have found many instances in which the Newton— 
Raphson algorithm based on numerical derivatives has produced satisfactory results. 
See Haughton (1997) for a survey of alternatives. No matter which algorithm is used, if 
the intergroup differences are small, the likelihood surface will tend to exhibit several 
local maxima. In any case, a single unique maximum is not guaranteed. 

All finite mixture models are unidentified in the sense that the distribution of the 
data is unchanged if the subpopulation labels are permuted. That is, relabeling “com- 
ponent 1” as “component 2,” or vice versa, makes no difference. This problem can be 
dealt with by specifying either the 7; or jz; to be nondecreasing. It is desirable that the 
component labels have some behavioral interpretation. 

One potential limitation of the finite mixture model is that additional components 
may simply reflect the presence of outliers. Though this is not necessarily a bad thing, 
it is useful to be able to identify the outlying observations that are responsible for one 
or more components. Equation (18.23) can be useful in this regard. Postestimation one 
could calculate the posterior probability. For outliers these probabilities will be large 
with respect to one component and small with respect to the rest. 


18.6. Stock and Flow Sampling 


In many practical situations the following question arises: What is the relationship 
between two or more different average duration measures that are available? From de- 
mography comes the well-known distinction between average age and expected life 
span. In real estate there is the distinction between the average period that a property 
offered for sale has remained unsold and the expected period before which a newly 
added property for sale will be sold. Often the first concept is used in popular discus- 
sions when the second may be more relevant. In economics there is a similar question 
about the relationship between different measures of unemployment duration that are 
published by government statistical agencies. The issue of unobserved heterogeneity, 
as it pertains to the pool of the unemployed, and to the flow into that pool, is closely 
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involved in these discussions. One of the earlier influential discussion of these issues 
was given in Salant (1977). 

For specificity, let us focus on the familiar example of unemployment duration. 
One statistic that measures the unemployment experience of an already unemployed 
individual, published by statistical agencies in many countries, is the average inter- 
rupted duration (AID), which is the average period for which members of the current 
stock of unemployed have been unemployed. It is an estimate of the expected elapsed 
duration, a period for which a newly unemployed individual can expect to remain 
unemployed, often referred to as average duration of a complete spell of unemploy- 
ment (ACD), a measure that features prominently in the job search literature and is the 
one that the current and previous chapters have concentrated on. This is an estimate of 
the expected length of a completed duration. We may think of AID as a stock-based 
measure and ACD as a flow-based measure; the former is analogous to average age 
in a population and the latter to the expected life span. The question of interest is the 
relationship between the two. 

The appropriate statistical tool for handling issues such as these is renewal theory. 
The stationary Poisson process with constant intensity parameter is an example of a 
renewal process. The number of renewals in a time interval dt refers to the num- 
ber of events. Duration is the time between successive occurrences of events (i.e., re- 
newals). For an individual in a given state the backward recurrence time refers to the 
elapsed duration since renewal, and forward recurrence time refers to the duration 
from current state to a transition. The expected number of events, denoted E[N (t)], 
in the time interval (0, t)] is called the renewal function and limg;_,9 dE[N (t)] /dt is 
the renewal intensity, which determines the relationship between ACD and the aver- 
age backward recurrence time. In what follows, we concentrate on some well-known 
results. 

Salant (1977) showed that heterogeneity in hazard rates provides a key to under- 
standing the differences between AID and ACD. His diagrammatic representation 
provides intuition into the two key factors that affect the calculated averages. In Fig- 
ure 18.1 the vertical axis measures calendar time and the horizontal axis represents the 
date of the survey. Stock sampling refers to sampling in the survey period from the 
stock of individuals who are then in a given state. In contrast, flow sampling means 
that we sample those who enter the state during a particular interval. The lengths of 
spells in progress are shown as vertical lines. For illustration nine realizations of spells 
are shown and four of these (S6, S7, S8, and S9) are in progress on the survey date. 
Five spells (S1, S2, S3, S4, and S5) are completed during the 12-month survey period. 
If u; denotes the length of the jth in-progress spell sampled by the survey, then for 
our example, AID = 1/4 juj). If t; denotes the length of the ith completed spell 
sampled by the survey, then ACD = 1/5 © ti). 

Now observe that the survey is more likely to capture longer spells than shorter 
spells, and this leads to an upward bias that is the result of length-biased sampling. 
This type of bias is likely to lead to AID > ACD. However, because the survey mea- 
sures only incomplete durations, the average of such incomplete durations is likely 
to be shorter than the average of the completed durations. This is the phenomenon 
of interruption bias. The answer to the question of which source of bias dominates 
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Figure 18.1: Length-biased sampling under stock sampling: examples. 


depends on the distribution of spell lengths, and this in turn depends on the distribu- 
tion of hazard rates. Heterogeneous hazard rates provide a key to understanding the 
relationship between the two. 

The key assumption is that of a stationary environment which refers to a situation 
in which inflows into and the outflow from the state are equal. Let f(u) denote the 
density of interrupted spells and g(t) denote the density of completed spells. Then, the 
distribution of u is given by 


G (u) = G (u) 


Gada PN 


; (18.27) 


where 
G (u) = fewa 


is the survivor function corresponding to be density g(u). and E[t] is the mean of 
the distribution of completed durations. For a full derivation of this result and the 
underlying assumptions, see Salant (1977) or Lancaster (1990, Section 5.3). 

An implication of this result is that if g(t) is exponential, so that the stochastic 
process for the event is the Poisson process, then f(u) is also exponential, and the 
mean of both duration measures is equal. 
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Given (18.27), the general relationship between moments of the distributions of u 
and ¢ can be derived. One useful result links the mean of u to the mean and variance 


oft: 
E[u] = $ E[t] + uu) 18.28 


Another interesting result concerns the relationship between E[t] and the mean 
completed duration of the constant population with spells in progress (i.e., the aver- 
age across the stock of spells in progress). In line with intuition based on length-based 
sampling, the relation is 


© Vit] 

E[¢®] = Eft] + HA > Eft], (18.29) 
which says that the mean duration for the constant stock, denoted E[t], exceeds the 
average expected duration of a new spell. If f(t) is exponential, then E[t®] = 2E[r], 
and E[u] = 1/2E[t®]; on average the sampled interrupted spell will be halfway to 
completion. 

What if the hazard rate is not constant? If the hazard rate is increasing in spell 
length (i.e., positive state dependence) then E[u] < E[t], and if it is decreasing (i.e., 
negative state dependence) then E[u] > Eft]. 

Although these results have been obtained under the assumption of a constant pop- 
ulation, they have proved very useful in interpreting and clarifying the connections 
among various average measures of duration that are commonly employed. The results 
given here are valid regardless of the reason for spell occurrence. They also motivate 
a more careful investigation of the shape of the hazard function. 


18.7. Specification Testing 


Tests of specification in duration models take several different forms, including the 
following: 


e inclusion and exclusion tests for covariates, 
e tests of functional forms of the survival function, 
¢ tests of unobserved heterogeneity, and 


e joint tests of state dependence and unobserved heterogeneity. 


The first type of specification test does not raise new problems and can be handled 
by Wald-type tests. 

Tests of restrictions on functional form are the same as tests of unobserved hetero- 
geneity if the restriction is the absence of unobserved heterogeneity. Because the latter 
can bias the estimation of the hazard rate, as shown in the Section 18.2, diagnostic 
testing for unobserved heterogeneity is desirable. 

The standard formulation for this is to test whether the heterogeneity (variance) pa- 
rameter is zero. If this hypothesis is tested using the restricted model that assumes zero 
heterogeneity, a score test is appropriate. The use of the likelihood ratio or Wald test 
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based on the unrestricted model will be problematic if the hypothesis is a boundary 
hypothesis. For example, in the Weibull-gamma model (18.9), the restriction 1/5 = 0 
will specialize the model to the Weibull, but this is a boundary hypothesis. The stan- 
dard one-degree-of-freedom chi-square test has a weighted chi-squared distribution 
under the null. 


18.7.1. Hypothesis Tests 


One type of specification test is a score test of unobserved heterogeneity based on the 
exponential null model. Because of possible confounding between heterogeneity and 
duration dependence it is desirable to carry out a joint rather than a separate test. This 
can be done using the framework of a locally heterogenous Weibull model (Lancaster, 
1985). 

A locally heterogenous density is generated by considering a Taylor expansion of 
an arbitrary density around v = 1 of the Weibull density with multiplicative hetero- 
geneity v, yielding 

Spae ae 
= [1+ (ev = 1) + (€7/2)(v = 1? + Ole), 
where £ = ut”. From the second line 
Ele**] = e™°[1 + (€°07/2)] = Sn(t), 
where the term ø? is the variance of the heterogeneity distribution. 
Then 


Sm (t) 
Ím (t) = ət 


= apt? le-*[1 + (e°0?/2)] — e~* [2e(apt®|)o?/2] 
= apt*'e* [1+ 0°67 — 2e)/2]. 


Using the last result and allowing for censored observations, the log-likelihood is given 
by 


N 
In Lia, bB, o?) = 2 ln WAO [Sin (I) 


Mz 


ôi [Ina + (œ — 1) Ing; +1n u; + In (1 + 0? (e? — 28;) /2) — £i 
i=1 


+(1 — 6;)In (1 + o?e? /2)], 


where ô; is the censoring indicator, which takes the value one for uncensored dura- 
tions and zero otherwise, In u; = o + x; 61, and e; = u;t¥ is the generalized error 
(Section 18.7.2). 

The null hypothesis of interest is Hy : o? = 0 and a = 1. This is a joint test of 
zero unobserved heterogeneity and the exponential distribution specification. Let 0 = 
(0, 65) ,0 = (o°, a) , and 6, = (Bo, Bı) , and let 0) = (0, 1, Bo, Bı) denote the 
restricted vector. 
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For simplicity consider only the case of uncensored data. Then the joint score test 
Statistic is 


1 [v01 
LM» = 3s] i i]s (18.30) 


where s’ = [5 oe (e? — 2e;) >); 1+ —¢;)Ing], and Y'(r) denotes the first 
derivative of the digamma function dln T(r)/dr and d = 1/(N(W’(1) — 1)). To im- 
plement the test, LMup is evaluated at the null (i.e., replacing all quantities by their 
estimates under the null of exponential distribution). This test statistic has an asymp- 
totic x?(2) distribution (Jaggia and Trivedi, 1994). 

Notice that the matrix of the quadratic form in the LMup statistic is not diagonal. 
That is, the two components of the joint test are correlated. A separate test of hetero- 
geneity (duration dependence) has power against duration dependence (heterogene- 
ity). More explicitly, suppose we consider two separate score tests for heterogeneity 
and duration dependence. They are 


o 1l 2 y2 
LMa = 2 (Z (e? -28))*, (18.31) 
LMo = (Z0 +0- e)n}, (18.32) 


each of which has a x7(1) distribution under the null. The separate test of zero unob- 
served heterogeneity has power against the other null hypothesis because the tests are 
correlated, see (18.30). Consequently, inferring the direction of misspecification on the 
basis of a separate test can be misleading. 

Because the specification of unobserved heterogeneity and state dependence are 
closely related, testing hypotheses about them separately can produce misleading re- 
sults (Jaggia and Trivedi, 1994). Formally speaking, tests of state dependence in the 
presence of incorrectly neglected heterogeneity are biased, and the reverse is also true. 
Jaggia (1991c) reanalyzes strike duration data that have been analyzed in a misleading 
manner in the econometrics literature. Jaggia and Trivedi (1994) develop some joint 
tests for a class of parametric models. See also Bera and Yoon (1993) who consider 
the more general issue of hypothesis testing when the model is misspecified. 

Useful as these tests are in simple parametric models, the starting point of an inves- 
tigation might be a Weibull, Weibull-gamma, or proportional hazard model. In such 
cases testing for unobserved heterogeneity, or any other specification error, can be car- 
ried out using the integrated hazard function because in the absence of heterogeneity 
integrated hazard is a unit exponential random variable. We now discuss some graphi- 
cal methods for evaluating the fit of the model based on integrated hazard. 


18.7.2. Graphical Tools for Detecting Misspecification 


In Section 8.7.2 we developed the concept of generalized residuals. In nonlinear mod- 
els a clear-cut choice of such a measure is difficult. In the present context there is a 
good choice. 
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Generalized Residuals 


A useful type of test is a nonparametric graphical test of fit of the duration model. 
The test uses the generalized residual, which is defined as a certain function of data 
and estimated parameters. For a correctly specified model the residuals should behave 
approximately like an iid sample from a known distribution. The integrated hazard 
turns out to have such a property and hence functions as an ingredient for a residual- 
based specification test. In the context of duration models where from Section 17.3.1 


Sla) = exp[—A(¢|4)], 
FOW) = Atl) exp[—A(t|w)], 
consider the distribution of the generalized residual 
e€ = A(t|) (18.33) 


= —In(S(¢|)). 
The Jacobian of this transformation is 
|J| = dt/de 
i= 1 
dM(t\n)/dt 
1/A(t|). 


Given f(t|j), the transformation in (18.33), and the Jacobian of transformation, the 
density of € is given by 


1 
A(t|) exp (—€) CS) = exp(—€), (18.34) 


which does not depend on u; the density is the unit exponential distribution. This result 
was referred to in Sections 17.3.1 and 17.6.7. 


Diagnostic Test Based on Integrated Hazard 


A diagnostic test can be constructed by exploiting the unit exponential property of the 
generalized residual e under the null of correct specification. The survivor function 
of the generalized residual is S(€) = exp(—e). Hence — In S(e) = A(e) = e. For a 
correctly specified model, a graphical comparison of the estimated integrated hazard 
with € should yield an approximately linear positive relationship with 45° slope. If the 
plot deviates significantly from the 45° line a misspecification could be indicated. 

For example, the estimated integrated hazard for the Weibull model is € = Zit“. 
Its survivor function is SO = N`! (number of sample observations > €). 

A small formalization of this is to regress — In S(€) on € and an intercept and test 
whether the intercept is zero and the slope equals one. 

The technique may be applied to any parametric model for which the integrated haz- 
ard expression is available. For example, the generalized error for the Weibull-gamma 
mixture (easily specialized to an exponential-gamma mixture by setting œ = 1) 
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is € =kIn[(k + ut*)/k]. To apply the test, compute € given estimates of (u, a, k), 
and then plot € against — In S (©. 


Censored Data 


In the case of censored observations the observed duration t = min[7, L], where L 
denotes the right-censoring limit. If the observation exceeds L it is censored at L. Then 
the generalized error e(t) is not unit exponential distributed. The following derivation 
leads to a relationship that suggests an adjustment for censoring: 


© ef (€) 
E[e(T)|T > L] = 
En el [se (L)) 


1 [0.6] 
= —_ ee “de 
eh) Sa | 


1 
= 0D [bees ee] 


€ 


1+e(L), (18.35) 


upon integration by parts and simplification. 

This suggests that one might estimate the generalized error as €(t) = €(t) if data 
are not censored, and as €(t) = 1 + €(L) if the observations are censored. Available 
results suggest that this technique works reasonably well in the censored exponential 
model when the proportion of censoring is not too heavy (Jaggia and Trivedi, 1994; 
Jaggia, 1997). 


18.7.3. Conditional Moment Tests 


The conditional moment framework (see Section 8.2) applied to the generalized 
residuals provides a fruitful approach to specification testing. The idea can be illus- 
trated in the context of tests of unobserved heterogeneity. 

The integrated hazard function was shown previously to be distributed as a unit 
exponential random variable with mean 1 and variance 1. In this case the conditional 
second-moment restriction of interest is E[(e — 1)]? = V[e] = 1, or equivalently 


E[e? — 2] =0. 


Higher order moment restrictions can also be generated and tested jointly or separately. 
For details see Jaggia (1991a). 


18.8. Unobserved Heterogeneity Example: 
Unemployment Duration 


In this section, we rework the empirical example of Section 17.11 under the assump- 
tion that unobserved heterogeneity is present and can be parameterized within an ana- 
lytically tractable parametric model. 
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Exponential Model Residuals 
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Generalized (Cox-Snell) Residual 


Figure 18.2: Unemployment duration: generalized residuals from the exponential model. 
U.S. data from 1986-92 on 3343 spells, some incomplete. 


As discussed in Section 18.7.2, we can use a graphical tool to examine the possible 
presence of unobserved heterogeneity by looking at the estimated fit of the model. For 
a correctly specified model, the residuals should follow the unit exponential distribu- 
tion. One can evaluate the model fit informally by computing and plotting the em- 
pirical cumulated hazard function against the generalized residual. For a correctly 
specified model the plot should exhibits an approximate straight line with slope 
one. 

Figures 18.2 and 18.3 show the generalized residual plots for the exponential model 
without and with (gamma) heterogeneity, respectively. As we can see from the two 
graphs, the fit of the model improves only marginally after we introduce unobserved 
heterogeneity. 


Exponential-Gamma Model Residuals 
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Figure 18.3: Unemployment duration: generalized residuals from the exponential-gamma 
model. Same data as Figure 18.2. 
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Table 18.1. Unemployment Duration: Exponential Model with 
Gamma and IG Heterogeneity 


Exponential-Gamma Exponential-IG 

Variable Coeff. t Coeff. t 
RR 0.501 0.817 0.504 0.821 
DR —0.882 —1.118 —0.807 1.032 
UI —1.585 —6.043 —1.545 —5.994 
RRUI 1.091 1.725 1.057 1.686 
DRUI 0.057 0.055 —0.013 —0.012 
LNWAGE 0.379 3.184 0.373 3.156 
CONS —4.095 —4.507 —4.097 —4.545 
o? 0.232 3.178 0.207 2.925 
—ln L 2695.35 2696.48 


This result can be verified by the actual estimates shown in Table 18.1, which 
also presents the estimates of the exponential model with inverse-Gaussian (IG) het- 
erogeneity. Although there is evidence of significant unobserved heterogeneity, the 
estimates of coefficients in these two settings do not differ much from what we 
have obtained earlier without the presence of unobserved heterogeneity. It is ex- 
pected that the presence of unobserved heterogeneity will have a large impact on 
the duration dependence parameter, as this factor is absent from the exponential 
model. 

However, a more interesting case arises when we consider a model with duration 
dependence and unobserved heterogeneity. Without presuming that it is the “correct” 
model, we consider the Weibull distribution—inverse Gaussian mixture model. For ease 
of comparison, we present these estimates in Table 18.2 along with the estimates when 
unobserved heterogeneity is neglected. 

The introduction of unobserved heterogeneity has a substantial impact on the du- 
ration dependence parameter, which increases from 1.129 in Table 17.8 to 1.753 in 
Table 18.2. The latter implies a more steeply rising hazard rate out of unemployment 
than was the case when unobserved heterogeneity was ignored. Recall from Section 
18.2.4 that one of the consequences of neglected heterogeneity in proportional haz- 
ards model is to underestimate the hazard rate; so the aforementioned empirical find- 
ing is consistent with theory. Second, note that the evidence for unobserved hetero- 
geneity is very strong; the estimated variance parameter o7 has a t-ratio exceeding 
11. Third, the fit of the model, as reflected in the log-likelihood, has also improved 
(from —2687.6 to —2616.6). Although there is not much qualitative change in the es- 
timates of the coefficients, the effects of the significant coefficients (UI, LNWAGE, 
and CONS) have become more pronounced after unobserved heterogeneity is 
introduced. 

The improvement in the fit of the model notwithstanding, the new mixture model 
could still be misspecified. Once again we use the graphical device as an informal 
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Table 18.2. Unemployment Duration: Weibull Model with and 
without IG Hetorogeneity 


Weibull-IG Weibull 
Variable Coeff. t Coeff. t 
RR 0.736 0.812 0.448 0.70 
DR — 1.073 —0.933 —0.427 —0.53 
UI —2.575 —6.698 —1.496 —5.67 
RRUI 1.734 1.857 1.105 1.57 
DRUI —0.061 —0.039 —0.299 —0.28 
LNWAGE 0.576 3.259 0.37 2.99 
CONS —5.303 —3.953 —4.358 —4.74 
a 1.753 44.19 1.129 51.44 
o? 6.377 11.149 - - 
—In L 2616.6 2687.6 


specification test. Figures 18.4 and 18.5 plot the generalized residuals from the Weibull 
model with and without unobserved heterogeneity. The plots suggest that the mix- 
ture model, despite being more general than the exponential-IG model, appears to 
be misspecified. To reiterate, although a simpler model that allows for neither du- 
ration dependence nor unobserved heterogeneity shows little graphical evidence of 
misspecification, an “improved” specification that generalizes the model in both direc- 
tions appears to be misspecified, a result similar to that of Jaggia (1991c). The appar- 
ent puzzle may be resolved by the argument that the interaction between heterogeneity 
and duration dependence accounts for the result. The Weibull model assumes mono- 
tonic hazards. However, McCall (1996) provides evidence based on the same data that 
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Figure 18.4: Unemployment duration: generalized residuals from the Weibull model. Same 
data as Figure 18.2. 
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Weibull-IG Model Residuals 
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Figure 18.5: Unemployment duration: generalized residuals from the Weibull-lnverse 
Gaussian model. Same data as Figure 18.2. 


a bathtub-shaped hazard function is more appropriate. He specifies a polynomial base- 
line hazard function that is less restrictive than the monotonic function used here. Thus 
a reasonable interpretation of our results is that a model that simultaneously allows 
for both unobserved heterogeneity and duration dependence makes it easier to detect 
misspecification than a model that ignores both. 

Finally, we implement a parametric test for the presence of unobserved heterogene- 
ity. The purpose is to illustrate some of the theory discussed in Section 18.7. The 
score test for neglected heterogeneity developed in Section 18.7.1 assumed uncensored 
data. Because the data used here include right-censored observations we implement the 
score test for the censored sample developed by Jaggia (1997). 

We wish to test for zero unobserved heterogeneity, Ho : o? = 0, in the exponen- 
tial duration model. Let the parameter set be denoted by 0 = (c?, B) and let s(0o) 
and Z (0o) be, respectively, the score and the information matrix calculated under the 
null. Using the log-likelihood derived in Section 18.7.1, we can write s(@9) = (S;(90), 


s2(80)), where s1(80) = 24], = 4 De? — 2Cie:) and T (00) = —E Ea „The 
0 
score test for unobserved heterogeneity is then given by 
LM = s1 @0)T"' Bo)s1 0) v x7(1), (18.36) 


where Z!! = [Z — Z (T2)! Zn]! is the first diagonal component of the parti- 
tioned inverse of Z(0), given in Jaggia (1997), and the tilde superscript is used to 
denote restricted maximum likelihood estimates. 

For our sample, we found that LM = 44.25, which far exceeds the critical value 
of x7(1) and hence we reject the null of øo? = 0. This result is consistent with that 
from the Weibull-gamma and Weibull-IG models where a significant improvement 
in the fit of the model resulted from introduction of unobserved heterogeneity. As 
previously noted, this test has power against a test of misspecified duration dependence 
also. 
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18.10. BIBLIOGRAPHIC NOTES 
18.9. Practical Considerations 


The issue of interaction between hazard function and unobserved heterogeneity has 
generated a huge literature. One point of view that is well documented states essen- 
tially that if the hazard function is well specified then the precise parametric specifi- 
cation of the heterogeneity distribution is relatively innocuous (Manton et al., 1986). 
This view implies that rather than parametrically modeling unobserved heterogeneity 
we can simply use robust variance estimates, given that the hazard function is well 
specified. Other studies suggest that parametric specification of the heterogeneity dis- 
tribution is not innocuous (Heckman and Singer, 1984a) and that it is desirable to 
use a nonparametric specification. Some highly influential work has advocated use 
of a discrete hazard model with a very flexible specification of the hazard function, 
combined with a parametric assumption about heterogeneity (Meyer, 1990; Han and 
Hausman, 1990). Finally, as a compromise between all the foregoing positions, some 
researchers use the Han—Hausman discrete-time approach, or a high-order polynomial 
hazard function, and combine it with the Heckman—Singer approach of nonparametric 
heterogeneity. However, as Baker and Melino (2000) have pointed out, this may lead to 
overparameterization that is far from innocuous. Hence it seems sensible to approach 
this issue with caution, and use parsimonious models in preference to models saturated 
with heterogeneity parameters. 

The Cox PH model has a central place in the biometrics literature. When there is no 
intrinsic interest in the baseline hazard function then this seems an attractive choice of 
functional form. It is often a good place to start modeling. However, unobserved het- 
erogeneity is important in most econometric specifications and should not be ignored. 

Many statistical packages offer a choice of standard parametric duration models that 
can be combined with any of the standard (gamma, inverse-Gaussian, or log—normal) 
heterogeneity (“frailty”) specifications. Although this is a very convenient option to 
use, discrete hazard models hold greater appeal as they provide greater flexibility and 
a better match with economic data. 

The implementation of the EM algorithm for the latent class model often suffers 
from slow computational speed. Direct maximization of the likelihood is often both 
feasible and efficient. 


18.10. Bibliographic Notes 


18.2 There are many papers that discuss the specification of the heterogeneity distribution 
and consequences of misspecification. Vaupel et al. (1979) provide a good discus- 
sion of the properties of the gamma model. Hougaard (1984) considers several al- 
ternatives to the gamma. Hougaard (1995) gives a survey of heterogeneity models. 
Heckman and Singer (1984a) advocate nonparametric specification and emphasize 
the sensitivity to misspecification. Manton et al. (1986) attempt to disentangle the 
relative importance of misspecifying the hazard and heterogeneity, suggesting that 
the former is critical. 


18.3 Van den Berg (2001) provides a thorough and accessible treatment of and further 
references on the identification of the MPH model. 
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Han and Hausman (1990) and Meyer (1990) offer good empirical examples that com- 
bine flexible hazard specifications with parametric assumptions about heterogeneity. 


The paper by Heckman and Singer (1984a) is an early discussion of the discrete 
heterogeneity model. The finite mixture model of unobserved heterogeneity is also 
commonly referred to as the “nonparametric heterogeneity” model. Baker and Melino 
(2000) describe a Monte Carlo study of duration dependence and nonparametric het- 
erogeneity. They consider models with very flexible specification of duration depen- 
dence with nonparametric heterogeneity. Their results suggest that, when both are 
present, the strategy of having many finite mixture components in likelihood gener- 
ates large biases and unreliable results. Using the BIC or the Hannan—Quinn criterion, 
which penalizes overparameterization, can be helpful. 


Lancaster (1990) and Salant (1977) are excellent references on length-biased sam- 
pling. Lancaster provides foundational material on renewal theory that forms the ba- 
sis of several key results. Also see Taylor and Karlin (1994). 


There are many papers on specification testing in duration models, most of them 
handling the easier case of no censoring. Kiefer (1988) provides an overview. Jaggia 
(1991a) offers a brief but clear introduction to the conditional moment approach to 
specification testing (which is also summarized in Greene (2003)). As yet untried 
in the context of duration models is a very general, but computationally demanding, 
approach to specification testing due to Andrews (1997). Model selection issues for 
finite mixture models are discussed in Cameron and Trivedi (1998, chapter 6), in the 
context of count models. A good introduction to model diagnostics based on different 
types of residuals for duration models is given in Hosmer and Lemeshow (1999, 
pp. 196-240). 


Lancaster’s (1979) classic empirical paper analyzes unemployment duration in the 
context of a Weibull-gamma mixture model. Jaggia (1991c) studies misspecifica- 
tion in a strike duration model using a generalized gamma model that nests several 
popular specifications. His paper also highlights the difficulty of making inferences 
from overly restrictive models. A number of other applications of duration models 
are covered in Chapter 19. 


Exercises 


(Adapted from Sapra, 2002) The analysis of Section 18.2 shows the effects 
of unobserved heterogeneity on the unconditional or averaged hazard function. 
The result that neglected heterogeneity leads to under-estimation of the slope of 
the average hazard function is emphasized. Let the conditional hazard function 
be àc(t|v) = vAo(t), where A, denotes the baseline or unconditional hazard func- 
tion. Show that (i) the unconditional hazard A y(t) < Ao(t) and (ii) dAy(t)/at < O in 
each of the following examples. 

(a) v ~ Uniform[0, 1] and Ao(t) = 1 Y t. 

(b) v follows a unit exponential distribution with pdf g(v) =e” and Ao(t) = 

pexp(yt), p > 0, y <0. 
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18-2 


18-3 


18-4 


18.10. BIBLIOGRAPHIC NOTES 


Reconsider the Weibull-gamma model of Section 18.2.3 after replacing the 
gamma distributed heterogeneity assumption by the assumption that hetero- 
geneity is distributed according to the log-normal distribution with unit mean. 


(a) Verify that in this case an analytical expression for the unconditional hazard 
function is not obtainable. 

(b) Substitute the integral expression for unconditional hazard into the log- 
likelihood given in Section 17.6.3. Using the simulation-based maximum 
likelinood approach of Section 12.4, describe an estimation algorithm that 
details the various steps involved in likelihood maximization. 


Consider the exponential-gamma mixture. This model is a special case of a 
MPH model. The survivor function, conditional on a multiplicative heterogeneity 
factor v, for the exponential model is S(t|v) = exp(—ptv), 4 > 0. The uncon- 
ditional survivor function is given by the average survivor function. Averaging 
is across the heterogeneous sae using g(v), the density of v, as the 
weighting function, so S(t) =h S (t|v)9(v)dv. Assume that v is (two-parameter) 
gamma distributed with ~ ) = bky" a ôv)/ T(K). 

(a) Show that, given gamma heterogeneity, S(t) = (1 + ut/d)* 

(b) Derive expressions for the unconditional duration density function f(t) and 
the unconditional hazard function A(t). These general expressions can be 
specialized by setting the mean of v at 1; that is, set k = 5, which leads to the 
exponential-gamma mixture. Compare the mean and variance properties of 
this mixture distribution with those of the original exponential distribution. 
Suppose that the random variable v has a two-point distribution such that 
with probability x it takes the value vı and with probability (1 — zr) it takes the 
value v2. What are the implications of this assumption for the specification 
of the unconditional survivor function? Explain your answer. 


(c 


~ 


Using the sample of the McCall data set from the empirical exercise in the pre- 
vious chapter, reestimate the Weibull model for those transiting to full-time em- 
ployment (CENSOR1 = 1) under the assumption that unobserved heterogeneity 

(also called frailty in some computer packages, which may also have a subcom- 

mand for specifying it) has gamma distribution. 

(a) Using generalized residuals as in Section 18.7.2 test the hypothesis of 
model misspecification. 

(b) Does the new model display a duration dependence property? Does it pro- 
vide a better fit to the data? Explain the results by reference to the interaction 
between unobserved heterogeneity and duration dependence. 

(c) Repeat the exercise of part (a) under the assumption of log-normal het- 
erogeneity. Are the results about duration dependence significantly different 
from those for the gamma heterogeneity? 
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CHAPTER 19 


Models of Multiple Hazards 


19.1. Introduction 


This chapter deals with several different duration models that can be interpreted 
broadly as multivariate models, a category that covers both parallel and repeated tran- 
sitions. Any transition model that involves more than one destination state can be re- 
garded as a multivariate model because the analysis will involve joint distribution of 
two or more durations. The models we consider arise in a variety of ways and apply 
to several different types of data. Despite their differences, they are grouped in this 
chapter for reasons of organizational convenience. 

To be concrete consider some examples. A familiar model from labor economics 
involves a transition from unemployment to employment or out of the labor force. The 
first transition can be further broken into return to the old job or to a new job. These 
destinations are mutually exclusive. An unemployment spell may end by a transition 
to any one of the destinations. A variant of this example considers an unemployed in- 
dividual who could find either a new full-time or part-time job or remain unemployed. 
Thus there are three possible states (destinations). The models of Chapters 17 and 18 
dealt with transitions between two states. One can still use the two-state methods to 
handle such data. For example, state 1 could be that of full-time employment and state 
0 could be any other state. This would, as before, involve modeling one hazard rate. 
However, one could also characterize this situation in terms of a model with three 
states and two transitions and hence two hazard functions, one specific to each desti- 
nation state. More generally, there will be a number of failure types and we may wish 
to model the transition from a given state to any one of the failure types. In this chapter 
we wish to extend the conceptual tools developed in the previous two chapters to deal 
with multiple hazards (failures) or a multivariate duration model. 

The important issues are as follows: 


1. How does one model the relation between covariates and failures of different types? 


2. How does one model interaction between failure types under a specific set of study 
conditions? 
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3. How does one estimate failure rates for certain types of failures given the “removal” of 
some or all other failure types? 


A multivariate duration model involves simultaneous modeling of all transitions, 
that is, joint specification and estimation of two or more hazard functions. There are 
several possible frameworks for analyzing multivariate duration data; the competing 
risks framework is one of the most popular. McCall (1996) provides an empirical 
application of the competing risks framework to unemployment data with focus on 
the role of unemployment insurance. Using an approach similar to McCall’s, Deng, 
Quigley, and Van Order (2000) study the transitions of mortgage holders to the states 
of prepayment or termination of mortgages. 

What is the motivation for and the gains from joint modeling of hazards? If the 
different hazards are essentially independent then separate and joint modeling will 
produce the same results. However, different hazards may be linked; for example, there 
may be present a common unobserved heterogeneity term in each hazard function. 
Alternatively, each hazard may include an unobserved heterogeneity term with one or 
more common shared components, leading to correlated hazards. 

A second class of examples involves a case of parallel events in which one ana- 
lyzes the joint distribution of durations to destinations. For example, the pair (7), T2) 
could be the duration of unemployment and duration without health insurance. Here 
the motivation for joint estimation of the hazards could be similar to that previously 
outlined. 

A third example involves joint distribution of lengths of repeat spells in the same 
state (e.g., unemployment, or without health insurance). That is, for a given individual, 
one wants to simultaneously model the hazards of terminating a spell. If the spells in 
question are independent, then they can be analyzed by single-spell methods of earlier 
chapters. If the researcher wants to study the dependence structure of the transitions, 
then joint modeling of spells in a given state is appropriate. New models and methods 
are called for when the spells are dependent. This last example is potentially more 
complex than the preceding ones because of possible dependence between events sep- 
arated by time intervals. For example, the length and type of a previous spell, or more 
generally the past history of spells, may affect the probability and length of a succeed- 
ing spell; or the unobserved characteristics of the individual may persist over succes- 
sive spells. Such serially correlated unobserved heterogeneity creates a link between 
repeat spells. Even the occurrence probability of an event may depend on previous 
occurrence of the same event. Heckman and Borjas (1980) characterize several struc- 
tural types of state dependence for an individual using concepts such as occurrence 
dependence and (Markovian) lagged duration dependence. 

Corresponding to these different data situations are a variety of models in the liter- 
ature. However, though they might appear to be a disparate selection they are linked 
by several common threads. After introducing the basic concepts, in Section 19.2 we 
examine the popular competing risks model. In Section 19.3 we consider a multivari- 
ate model based on marginal distributions of a set of survival times and introduce the 
copula approach to joint modeling of survival times. Multiple-spell modeling is con- 
sidered in Section 19.4. 
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19.2. Competing Risks 


First, we introduce some concepts that are used to in the competing risks model 
(CRM) and in other multivariate formulations. Often these are extensions of concepts 
already introduced in Chapter 17. The basic CRM formulation is applicable to mod- 
eling time in one state when exit is to a number of competing states, such as different 
causes of death. The CRM is attractive because it is relatively straightforward to im- 
plement if the model is a PH model. 


19.2.1. Basic Concepts 


We now consider the CRM in which there are m latent duration or failure times, one 
for each distinct competing cause of failure. 


Latent Durations 


The setup of the model is as follows. Each subject has an underlying failure time, 
which is subject to censoring. Failure time may be one of m different types, given by 
the set J = {1,..., m}. We may think of this as a situation with m distinct causes of 
transition from a given state (“death”). However, the occurrence of a failure of one 
kind of event removes the individual from risks of other kinds of events. Therefore, 
given censoring of the remaining (m — 1) durations for each individual, we observe at 
most one complete duration. 

In a CRM with m types of failures, there are m + 1 states {0,1,...,m}, where 
O represents the initial state and {1,...,m} are possible destination states. For the 
ith individual the data vector is of the form (x;, ti, dij,..., dmi, dci), Where x; is a 
vector of weakly exogenous covariates that measure the characteristics of i, t; = 
min (tii, .--, Ími, tci), where tgi denotes the time to transition to the kth destination, 
tci denotes the time of censoring, and dj; = 1 (tji = ti) ,J=l,...,m,c are dummy 
variables that take the value one if tj; = t;. Because we only observe one of the t;;, the 
remaining are interpreted as latent variables. 

Censoring may be regarded as a competing risk. It operates on individuals according 
to a probability distribution. In this chapter the censoring variable is assumed to be 


independent of the (ti, ... , tm). 
Unobserved characteristics of i are subsumed under unobserved heterogeneity, de- 
noted as v. If v varies with cause of exit, then we write it as vj, j = 1,..., m. 


Competing Causes 


A standard example of competing risks is death from competing causes. Consider 
an individual who has had a kidney transplant operation and is “at risk” of transit- 
ing to the healthy state, or to rejection, or to some other unhealthy condition such 
as a liver complication. Succumbing to any one condition means that transition to 
other states is not possible. So in an m-event setup, each event provides one complete 
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duration and m — 1 censored durations. Thus we have a situation of “competing 
risks” in which there is competition to determine the transplant patient’s destination 
state. 

Although discrete-time models are often required in empirical applications, our ex- 
position of the joint hazard formulation uses the continuous-time framework and gen- 
erally follows the exposition given in Mealli and Pudney (1996). We also assume that 
we have single-spell data. 

The model provides the joint distribution of the spell duration, denoted t, and 
the exit route r, which is an integer variable that takes one of the values in the set 
(1,2,...,m). 

We ignore censoring for simplicity and assume that there exist latent variables 
(t1,...,tm), one for each destination, that correspond to the spell duration for each 
possible exit route by which the spell may terminate if there were no other risk factors 
that might cause the spell to end sooner. Destination-specific covariates are denoted by 
x; (j = 1,...,m). We observe one duration, t, where 


T = min(t,..., tm) (19.1) 


= min (t;) , tj) >09, 
j 


at the termination of the spell; that is, only the shortest duration is observed and the 
rest are censored. Censoring owing to factors other than exit are not considered. Then 


Pr[t > t] =Pr[t > t,..., tm >t] (19.2) 
= S(t), 


which is the joint survivor function. If the risks are independent then 
Pr[t > t] = Pr[t > t] x Prim >t)x---xPr[t, >t]. (19.3) 
The corresponding exit route r is given by 
r = arg min (t;) : (19.4) 
Let g;(t)dt denote the probability of succumbing to risk j in the interval (t, t + dt); 
then the total hazard rate applicable to all causes is 
a(t) = —d/dt In S(t) = D> gi(t). 
j=l 


In biostatistics this is referred to as the total force of mortality (David and 
Moeschberger, 1978). If risks are independent, then the hazard rate for a specific cause 
j is Aj(t) = g;(t). This means that probability of failure from cause j in (t, t + dt), 
conditional on survival to £, is the same whether j is one of the risks or the only risk. 
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The probability of surviving the risk j in the interval (T;, T2) conditional on surviv- 


ing to T} is 
Tr To Ti 
if Aj (t)dt = | àj (t)dt -f hj (t)dt (19.5) 
Tı 0 0 


= In S(T) — In S(T) 
Pr[t; > To] 
Š Pr [t; > T| 


or equivalently 
h Pr [tj > Tz] 
— A; dt | = ——_-.. 19.6 
exp ( f J ( ) ) Pr p Z Tı] ( ) 


One minus the left-hand side expression is referred to as the net probability of death 
from cause j in the interval (Ti, T2) . The expression in (19.6) is useful for building up 
the likelihood function for estimation. 


Independent Risks 


We can now explicitly bring into the picture covariates that affect the hazard rate. We 

assume independent risks (as opposed to correlated risks) and consider the distribu- 

tion of t;. The hazard rate for failure of jth type is defined by 

Pr[t; < T < tj + At, IT > tj, Xj] 
At j 


’ 


Aste) = fi 
and the integrated hazard A ;(t;|x;) for the jth type risk is defined by 


tj 
0 
Then the duration density is 
FGI, Bj) = Atl, BSI, B;), 
= Aj (tlx; B) expl—A;(|x;, B), 

using the relation between survivor and integrated hazard functions. Defining x = 
[x1,.--,Xm]’ and B = [G,,..., BmI gives the joint density of t andr: 

fj, |x, B) = f-(TIx,, 6B.) | [expl—A j(t1x;, 6;)] (19.7) 

pt 
Àr (tix, B,) exp[—A,(t|x,, BDI 
x | [expl—A j(t1x;, B 
Jer 


Ar (tx, 8,) | [expl—Aj(tIx;, 6). 
j=1 


J 


The first line follows from the product of conditional and marginal probabilities. The 
second term on the right-hand side is the product of survival probabilities for all exit 
routes other than r, which uses the independence of risks assumption. 
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Equation (19.7) implies that 
Aj (tlx; 8; ) exp - —A (tIx;, 2) (19.8) 
j=! 


=i, (t|x;, B;) exp [-4°(t]x, B)] : 


where A*(t|x, 6) = paar A ;(t|x;, B;) is the aggregate or overall integrated hazard. 
This last equation simply says that the total hazard of leaving the origin state is the 
sum of hazards for all destinations. The overall survivor function is 


S(t) = exp (—A*(t)) . (19.9) 


The likelihood function given independent risks is the product over all observations 
of terms like (19.7). This likelihood can be written explicitly if all relevant functional 
forms are specified. Many issues that were previously relevant, such as flexibility of 
functional form, unobserved heterogeneity, and so forth, remain relevant in the context 
of CRM. Instead of keeping the discussion at a general level, we now consider specific 
functional forms. The proportional hazard specification is popular in the literature and 
will be used here. 


19.2.2. CRM with Proportional Hazards 


The goal here is to derive the joint density of spell length and reason for exit, and this 
can be done by aggregating the integrated hazard over reasons for exit. 
Consider PH models of the form 


dj (tx) = Aoj(thexp[x'(HB;], j=l,...,m, 


where both the baseline hazard 49; and (3; are specific to type j hazard, and tj; < 
- < tjx, denote the k; ordered failures of type j. For example, if m = 2, then kı refers 
to the number of individuals who registered a failure of type 1, and kz to the number 
of individuals who registered a failure of type 2. 
The likelihood function for the Cox CRM given is then 


kj 


= z exp[x’,;(¢j)3;] 
L(B,,---, Bm) = S T OOA (19.10) 


where 


ke: 
i); 
L)=[] = eau l (19.11) 


ER(tji) SEI, 


Notice the following four features of this likelihood: (1) L;(G;) is the partial like- 
lihood developed in Section 17.8.2. The baseline hazard function is absent, and the 
asymptotic distribution results stated previously also apply. (2) L(G), ..., Bm) can be 
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jointly maximized by maximizing each individual factor L;(G;), given the indepen- 
dence of risks; hence joint and separate maximizations are equivalent. Estimation and 
comparison of the ;s can be made by applying standard asymptotic techniques to 
each individual factor in the m-term likelihood. (3) The ideas of Sections 17.7 and 17.8 
can be extended directly. If a discrete-time (dummy variable) formulation is used for 
each type of hazard, then the identifiable components of the hazard function can be es- 
timated for each type of hazard jointly with the @ ;. (4) Unobserved heterogeneity can 
be introduced exactly as in the single-spell, two-state proportional hazards model in 
Chapter 18. 


19.2.3. Identification of CRM 


Cox (1962a) and Tsiatsis (1975) showed that when the CRM has no covariates, the 
model is not identified. More precisely, this means that any CRM with dependent risks 
is observationally equivalent to a CRM with independent risks. However, Heckman 
and Honoré (1989) showed that under certain assumptions a CRM that has the mixed 
PH form with covariates is identified. Van den Berg (2001, pp. 3438-3441) provides an 
exposition of the underlying assumptions. Assumptions additional to those discussed 
in Chapter 17 are needed. For example, the covariates must show “sufficient variation” 
and should not be perfectly collinear. We also require that the baseline hazards for 
different risks should not be perfectly related. 


19.2.4. Interpretation of Regression Coefficients 


In the proportional hazards type formulation of CRM, the impact of a change in a 
covariate on the hazard rate for transition from a given state is analogoue to the PH 
model in Chapter 17, but the direct interpretation of regression coefficients faces an 
interpretation problem similar to that discussed for the multinomial logit in Section 
15.4.3. 

However, one may also be interested in the impact of change in a covariate on the 
probability of exit via route r. This is harder to calculate. To see this note that the 
expression for the probability of exiting a given state via route r is given by 


Ay (t |x, B,) 
Daa Ay (TIX, Bj) 
Because covariates appear in both the numerator and the denominator, and more- 
over the denominator is the sum of all hazards, the sign of the partial derivative 
ð Pr[r|t, x, 6] /Ox,, depends on all the parameters in the model. It is then not true 
that the sign of £,, is also the sign of the partial. (The situation is exactly analogous 
to that discussed in Chapter 15 on multinomial models.) However, the following result 
is available if the competing risk is of the proportional hazard type (Thomas, 1996, 
p- 31). If Bre > Bye, Yj Ar, then the sign of 0 Pr[r|t, x, GB] /dx,x is positive. In 
words, an increase in x, will increase the conditional probability of exit via route r 
if its estimated coefficient in A,(-) is larger than the corresponding coefficients in all 
other hazard functions. 


Pr[r|t, x, 8] = (19.12) 
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19.2.5. CRM with Unobserved Heterogeneity 


If the competing risks are of the proportional hazards type, then the methods of the pre- 
vious chapter can be extended to include unobserved heterogeneity. A general specifi- 
cation of unobserved heterogeneity allows for a state-specific random component. Let 
V =(v,... Vm) be the vector of unobserved multiplicative heterogeneity terms that are 
assumed to have a joint distribution function G(v); then, 


j=l 


fi(t, rix, B, v) = Aj;(T|x;, Bj, vj) exp È —Aj(t|x;, Bj, ” 


m 
= Aj(tIx;, B;)vj exp bs —A,j(tIx;, a | 
j=l 
where the second line follows from assumption of multiplicative heterogeneity. 
This is an example of a competing risks model with state-specific random effects. 
The distribution marginal with respect to v is obtained by integrating out v, 


m 

fj, rix, 6) = I r i A j(t|x;, Bj )v; exp È —A j(t|x;, Bim dGlv), 

j=l 
which involves an m-fold integral. 

A manageable case is one in which the m elements of v are independent gamma 
distributed random variables. In this case the m-fold integral decomposes into a prod- 
uct of m integrals. An example is the case in which we have a Weibull-gamma 
mixture for each cause-specific hazard function. In this case the competing risks are 
independent. 

If we allow the elements of v to be correlated, then we get a more interesting case in 
which the competing risks are dependent. Indeed, this is a very widely used “trick” for 
generating dependence among competing risks. Specifically, suppose we have a mul- 
tivariate log-normal distribution for v, that is, [In v; .. . In vm]! ~ N[0, £]. This has 
two consequences. First, it induces dependence in the competing risks through hetero- 
geneity; second, it makes computation of maximum likelihood estimates considerably 
more difficult. The reason for the latter is that the m-fold integral does not have an 
analytic expression. Consequently, Monte Carlo integration will have to be used. If 
m equals two or three as in many applied examples, this is still manageable but far 
from trivial. To reduce the dimensionality of the integral it may be useful to restrict 
the structure of the covariance matrix. For example, we may use a factor structure in 
which each term v; may be specified to be a linear function of (say) two iid random 
variables, with unknown weights (factor loadings). For identifiability, normalization 
restrictions on the weight coefficients may be necessary. 


19.2.6. CRM with Dependent Competing Risks 


The independent CRM has an important computational advantage over the model in 
which dependence is induced through heterogeneity variables correlated across com- 
peting hazards. However, the latter yields valuable additional information about the 
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structure of heterogeneity, such as the association parameter(s). Nonetheless, there re- 
mains the practical issue of how restrictive a specification of correlated heterogeneity 
one should choose. For exposition let us view the problem in a bivariate regression-like 
setting using the following setup similar to that in (17.20): 


In | f movan] = —x’8,;- v + €, 


In | hatudu] = —x’3,— v + €. 


Now we could assume v; = v2 = v, that is, exactly the same unobserved heterogene- 
ity term in both hazard models. The assumption is that the same unobserved factors 
affect both spells but their impact may differ. This amounts to perfectly correlated 
heterogeneity across the two hazards. Less restrictively, we could assume that, for ex- 
ample, vı and v2 are correlated and estimate an association parameter. We can think of 
these as one- and two-factor models of heterogeneity, respectively. Whether the more 
restrictive approach is empirically desirable depends on the context. For example, if 
the two hazards pertain to the same individual, and we think of vı and vz as reflecting 
individual-specific factors, then the one-factor model has justification. If, however, we 
think of the two factors as hazard-specific, then the two-factor model is more appeal- 
ing. There is some theoretical and Monte Carlo evidence that the use of the one-factor 
model when the two-factor model is the correct specification causes significant distor- 
tions (Lindeboom and Van den Berg, 1994). 


19.3. Joint Duration Distributions 


In this section we consider the case of nonmutually exclusive or parallel spells that are 
dependent. Survival times are assumed to be continuous. The exposition is at a general 
level and for simplicity it is restricted to the case where the spells are not censored and 
have parametric distributions. 

In applied work on jointly distributed survival times a natural starting point would 
be a particular functional form for the joint survival or the joint density function that 
may be used. Are there some “standard” functional forms available? Or is there a 
general method for generating the multivariate counterparts of the models considered 
in the previous chapters? We consider these issues in the following. 


19.3.1. Extending Survival Concepts to a Multivariate Setting 


It is helpful to begin by extending the definitions and concepts of the two previous 
chapters to the multivariate case. 
A multivariate survival function S(t) is defined by 


S) = S (ti, , tq) (19.13) 
= Pr[7 Shia Lg >t], 
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where 7\,..., T} are q survival times with univariate survival functions S$; (t i) . By 
definition, 


=5S(T1>0,..., Tj > eee > 0) 
= S(O, vive tj... 0). 
Unlike the case of the univariate survival function 


S(t,...,tg) A1—F(t,..., tg). 


For example, S(t), t2) = 1 — F(t) — F(t) + F(t, t). 
The joint density of (tı, . . . , tg) is denoted by f(t,..., tq); if F(ti, ..., tq) is con- 
tinuous then 


01 F(t, ..., tq) 

ti,- .., tg) =(-1)! : 19.15 

ft q4) = (—1) TET ( ) 

Analogous to the univariate case the joint hazard function is À (ti, ales tq) and is 
defined by 

S(t, ..-, tg) 

At, ..., tg) = ———.. 19.16 

a a) S,- ty) AAD 

The joint integrated hazard A(t), ... , tg) is the q-fold integral of A(t), ..., tg). How- 

ever, there is no simple relationship between A(t), ..., tg) and S(t), ..., tg) analogous 


to the univariate case. 

Given these definitions, is it possible to derive joint survival functions? Clayton and 
Cuzick (1985) consider a bivariate model that illustrates the definitions given here. The 
starting point in their analysis is an assumption about the “cross-hazard ratio” func- 
tion, defined as a function of two conditional hazard functions of tı, given T2 = t and 
T > t. This leads to a nonlinear, second-order partial differential equation whose 
solution generates a joint survival function in which the cross-hazard ratio function 
plays an important role. We refer to the original sources for detail but note that this ap- 
proach requires assumptions that may be difficult to extend beyond dimension higher 
than two. 


19.3.2. Bivariate Distributions Based on Marginals 


This section briefly reviews some approaches for generating bivariate duration models. 
The approach builds on assumptions about marginal survival functions. This may be 
useful if the researcher has a good feel for such marginal distributions and wants to use 
them as building blocks. Of course, choice of the building blocks places restrictions 
on the form of the resulting joint distribution. 

One approach, which is due to Marshall and Olkin (1990), considers a model with 
multiplicative unobserved heterogeneity in the marginal distributions of both failure 
times in the following way. Let f;(4|x;, v), i = 1, 2, denote the marginal distributions 
of t1, t2, given covariates X1, X2; here v is the common unobserved heterogeneity term 
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in the two marginals and is the source of association between the two hazards. In 
survival analysis such a model might be referred to as “shared frailty” model; it is the 
(only) source of correlation between f, and f. Assume that v, v > 0, has probability 
distribution with density g(v). The bivariate distribution of tı, t is formally defined as 


f(t, 211, X2) = f fitilxi, v) Plx, Wgdy, (19.17) 
0 


where distribution parameters are suppressed for notational simplicity. 

This bivariate distribution generated as a mixture may or may not have a closed- 
form solution, so without a specific parametric specification one cannot say whether 
the result will be computationally convenient to use. It is also the case that the resulting 
bivariate distribution will restrict the correlation between f, and f to be positive. In 
some cases this may not be desirable. 

This general approach, applicable to any type of data, can be specialized to the 
present case by replacing the marginal distributions with marginal survivor functions 
and deriving the joint survivor function by integrating out the variable v; thus, 


oa) 
S(t), t2|X1, X2) = i} S(t, lxi, v)S2(t2|X2, v)g(v)dv. (19.18) 
0 


An example of the application of this idea is provided by Clayton and Cuzick (1985), 
who use such a formulation to obtain a bivariate survivor function under the assump- 
tion of marginal proportional hazards with gamma heterogeneity. 

As illustrated this approach for generating bivariate survivor model is somewhat re- 
strictive. One source of restriction is the assumption of one-factor unobserved hetero- 
geneity. In principle this restriction is easy to remove. For example, we could replace v 
by (vı v2), vı > 0, v > 0, which represents a vector of two correlated elements, one 
specific to each survivor function, with a joint probability distribution g(v1, v2). Then 


CO 00 
S(t, folX1, X2) = f f Sew Saar dnde W 
0 0 
For concreteness suppose that 


Vy = @11£1 + @12€2, 
V2 = @21€| + @22€2, 


e~G[le%], j=1,2 


where low; pij=l, 2} are unknown parameters, frequently referred to as “factor 
loadings.” This says that heterogeneity components (vı, v2) are correlated linear 
combinations of iid random components £; and £z if factor loadings are not zero. 
Other popular assumptions in empirical work are (i) that (Ine), In £2) have a stan- 
dard bivariate normal distribution or (ii) that vı, v have a discrete (finite-mixture) 
distribution. So the model (19.19) has a bivariate mixture form. Additional identi- 
fying restrictions (e.g., the normalization w;; = 1) are necessary also. The Pearson 
correlation coefficient between vı and v2, Cov[v1, v2]/ [Vivi ]V[v]]!/ 2 depends on 
{æij, o?, i, j = 1,2} and it is straightforward to verify that here this quantity would 
not have the usual —1 and +1 as the lower and upper bounds. (Also note that the 
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corresponding association parameter for failure times is Cov[f,, t2]/ [VinlV[oq]!”, 
which is distinct from that given.) Van den Berg (1997) derives sharp bounds on 
Cor[t,, t2|x], specifically —1/3 < Cor[t,, |x] < 1/2, for a mixed proportional haz- 
ard model with constant baseline hazard, and shows that these bounds do not depend 
on the covariates x nor on the distribution of heterogeneity. If baseline hazard is not 
constant, the correlation bounds also depends on it. 

The factor loading specification has computational advantages relative to that in 
which the unobserved heterogeneity components enter in an unrestricted manner. Al- 
though a one-factor model is likely to be too restrictive, an unrestricted model gives 
rise to a potentially high dimensional integral. From a computational viewpoint, the re- 
sulting distribution may or may not be easy to handle, depending in part on whether or 
not the integration produces a closed-form expression for the joint survivor function. 
If it does not, a simulation-based approach will be needed for estimation. At present 
estimation of such a model would require going beyond standard packages. 

The factor loading specification does place restrictions on the model (Van den Berg, 
2001; Lindenboom and Van den Berg, 1994). For example, if one of the marginal mod- 
els does not indicate the presence of unobserved heterogeneity, then Cov[v;, v2] must 
be zero; if V[vı] > 0 and V[v2] > 0, then Cov[1, v2] 4 0. Hence if Cov[1,, v2] = 0, 
then one of the marginals has no unobserved heterogeneity. 

From an applied perspective an attractive multivariate survivor function should be 
flexible. The approach just outlined has some limitations. There are alternative ap- 
proaches that have been proposed. One such approach that holds some promise is the 
use of copula functions. Hougaard (2000, pp. 435-437) provides an introduction in the 
context of survival analysis. 


19.3.3. The Copula Approach 


Copulas, originally introduced by Sklar in a 1959 article in French (see also Sklar, 
1973), have been suggested as a useful method for deriving joint distributions given the 
marginals, especially when one wants to work with nonnormal distributions. Although 
we introduce this idea in the context of joint survival models, where it has found ready 
applications, it can also be used to study the joint distributions of any set of discrete, 
continuous, or mixed discrete/continuous variables. 

The approaches already discussed (e.g., the Marshall-Olkin method) generate 
dependence between variables through unobserved heterogeneity components. This 
seems attractive in most applications because it is impossible for observed covariates 
to cover all relevant aspects of an economic event. 


Properties of Copulas 


To define a copula we begin with possibly dependent uniform random variables 
U,,...,U, on the [0, 1] interval. The dependence relationship is described through 
their joint cdf 


C (u1, ..., uq) SPE Sten Ug < uq], (19.20) 
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where the function C(-) is the copula, and u; is a particular realization of U;, j = 
Freds 

The right-hand side is the joint cdf, F(-), and the q arguments of the copula can be 
replaced by q marginal cdfs F)(-),..., Fg(). That is, 


C (Fi (u1), ..., Fa (uq)) = F (aie std) 


defines a joint cdf. With a copula-based construction of a joint cdf we select a set of 
marginals and combine them to generate a joint cdf. A given copula is a functional 
form for combining selected marginals and different choices of C(-) lead to different 
joint cdfs. Sklar’s Theorem established that any multivariate distribution function 
can be written in the form (19.20) and that given continuous marginals the copula 
representation is unique. 

As specialized to a multivariate survival function, Sklar’s Theorem says that a q- 
dimensional multivariate survival function S(t;,...,¢,) has a corresponding copula 
representation C(S;(¢)),..., Sq(tq))- 

Consider the case q = 2. Then, 


F(t, t2) = Pr[T, <t, Th <h] 


=1-—Pr[TJ] > t]— Pr[h > b]+ Prin >t, > bb] 
and 
S(t), t2) = Pr[T, > t, Th > ta] 
=1- F(t) — F(t) + F(t, t) 
= Si) + Sx) — 1+ CU — Sim), 1 — S2(t2)), 


where C(-) is called the survival copula. Notice now that S(t,, t2) is now a function of 
the marginal survival functions only. 

Copulas have a certain symmetry property that allows one to work with copulas 
or survival copulas (Nelsen, 1999). Joe (1997) defines a bivariate copula associated 
with F(-), denoted by C(u, v), as a two-dimensional probability distribution function 
defined on the unit square [0, 1]? , with univariate marginals uniform on [0, 1] . For all 
(u, v) € [0, 1], C(u, 0) = C(O, v) = 0, C(u, 1) = u, and C(1, v) = v. In the context 
of survival copulas we replace u by the marginal survivor function S(t) and v by the 
second marginal survivor function S(t2). In this notation Sklar’s Theorem states that 
there exists a copula function C such that 


F(u, v) = C(F,(u), Fy(v)), (19.21) 


where F(u, v) = Pr[U <u, V < v] is a bivariate distribution function of random 

variables U and V, and F,,(u) and F,(v) denote the marginal distribution functions. 
If F is continuous, and if the univariate marginals have corresponding quantile func- 

tions F7! and F', then the unique copula in Equation (19.21) can be expressed as 


C(u1, u2) = F (F,'(u), F,'(v)). 


The copula approach involves specifying marginal distributions of each random 
variable along with a function (copula) that binds them together. The copula function 
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can be parameterized to include measures of dependence between the marginal distri- 
butions. If no dependence is detected, the two marginals are independent, and estima- 
tion can be performed on each variable separately. However, if dependence is present, 
improved estimates may be obtained by recovering a joint distribution by way of a 
copula function. Since a copula can capture dependence structures regardless of the 
form of the margins, a copula approach to modeling related variables is potentially 
very useful to econometricians. Frechet bounds make it possible to study the extent 
of dependence permitted by any copula. Despite apparent differences we see that the 
mixture approach of Section 19.3.2 for deriving the bivariate survival function leading 
to (19.19) is fundamentally similar to that based on the copula approach as both begin 
with marginals. 

We now consider an example with q durations (Ti, ETN T4) that are conditionally 
independent given common neglected unobserved heterogeneity v; covariates are ex- 
cluded for simplicity. Then the conditional joint survivor function is 


Pr[7) os eee Or > tlv] = Pr[7J| > t)|v] x sid SOE [Ty > tlv] 
= S [|v]... S [1 v] 


and the multivariate survival function is defined as 


Pr[Ti > ti., Tq > tg] = E [SEV -Saol v]. (19.22) 


Measuring Dependence 


The functional form of the copulas itself does not depend on the form of the univariate 
margins. Copulas are usually specified with parameters that generate a measure of the 
dependence between the univariate margins. Usually dependence is parameterized as 
a scalar measure. Here we concentrate on bivariate copulas for simplicity. 

The copula representation for discrete random variables is not necessarily unique 
(Joe, 1997, p. 14). This is not a major problem in practical application where the con- 
cern is to approximate the unknown joint distribution. The key modeling issue is to 
choose a sufficiently flexible parametric form for the copula function. 

The dependence parameters from copulas can be difficult to interpret because they 
are not necessarily in the [0, 1] interval. Therefore, it is customary to convert the de- 
pendence parameter to a familiar measure of association such as Kendall’s tau or 
Spearman’s rho; see Joe (1997). Schweizer and Wolff (1981) showed that Spear- 
man’s correlation coefficient can be expressed solely in terms of the copula function; 
thus, 


P(t, t2)= 2f {C (u, v) — uv} dudv. 


Consider any bivariate joint cdf F(t), t2) with univariate marginal cdfs F; (t1) and 
Fz(t2). By definition, 0 < Fı(tı), Fo(t2) < 1, because each marginal distribution takes 
a value in the range [0, 1]. The joint cdf is bounded below and above by the Frechet 
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Table 19.1. Some Standard Copula Functions 


Copula Type Function C(u, v) 0-Domain 

Product uv na‘ 

FGMS? uv(i +0(1 — u)(1 — v)) -1<0<41 

Normal* PLD! (u)! (v); 0] —1 <8 < +l 

Clayton (u? +v? — 1)? 0 € (0, œ) 

Frank —07! In(n — (1 — e™®™)(1 — e7*"))/n, 0 € (— 00, œœ) 
n=1—e? 


^ na, not applicable. 
P Farlie-Gumble—Morgenstern copula. 
€ ® denotes bivariate normal cdf. 


lower and upper bounds, F~ and F*, defined as 
F(t, 2) 2 F(t, t2) = max [Fi(t1) + Fo(t2) — 1,0], 
F(t, b) < Fh, t) = min[ F(t), F(t]. 


Since copulas are joint cdfs, they are also subject to the Frechet bounds. Knowledge 
of Frechet bounds is important in selecting an appropriate copula. Every copula places 
bounds on permissible values for its dependence parameter 6. A desirable feature of 
a bivariate copula is that as 0 approaches the lower (upper) bound of its permissible 
range, the copula approaches the Frechet lower (upper) bound. However, the paramet- 
ric form of a copula may impose restrictions such that one or both Frechet bounds are 
not included in the permissible range. Therefore, a particular copula may be a better 
choice for one data set than for another. 


Examples 


Table 19.1 gives examples of some bivariate copula functions that have been used in 
the literature. Joe (1997) discusses the properties of these copulas. 

The Normal and the Frank copulas include both Frechet bounds in their permissible 
ranges. The Clayton copula belongs to the Archimedean family, with the representa- 
tion C (u, v) = o(@ '(1 — u) + ¢71(1 — v)); see Smith (2003). 

Suppose we want to choose the Clayton copula to model the bivariate survival times 
(ti t2). Then the bivariate distribution, expressed in terms of marginal survival models 
S(t,) and S(t2), will be 


(SG) S =A, 


We assume that the marginal survival functions are specified up to unknown parame- 
ters. As before these marginal survival functions can be written to capture dependence 
on covariates and unobserved heterogeneity. For example, these could be based on the 
proportional hazards model. For estimation we can apply maximum likelihood based 
on the resulting bivariate copula. 
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This approach is not without limitations. Two in particular are noteworthy. First, 
extension to three or more dimensions is not trivial. Second, one needs not only to 
choose a particular functional form for the copula but also to be aware of its potential 
restrictiveness in capturing dependence for a given data set. For example, only positive 
correlation may be supported. 


Likelihoods Derived from Copulas 


To fit a model derived from a copula (defined in terms of the cdfs) the first step is 
to select a copula and the second is to derive the likelihood (defined in terms of the 
pdfs) from it. Having chosen a copula consider the derivation of the likelihood for the 
special case of a bivariate model with uncensored failure times (t1, t2). Define fj(t;) = 
OF (t;)/ot; and C;(Fı, F2)/dt; for j = 1, 2, define C12(F1, F2) = OCP), F2)/dt 0h. 
Then the probability density 


f(t, 2) = fit) f2(t2)Ci2 Fil), Falta), (19.23) 


where f(ti,%) = 0° F(t, to) /0t\0t2, is used to construct the likelihood function. If 
censored observations are present in the data, the likelihood must be appropriately 
modified (Frees and Valdez, 1998, pp. 15-16; Georges et al., 2001). 

Using different copulas generates nonnested models. As in other similar instances, 
penalized log-likelihood values can be used to choose among them. 


19.4. Multiple Spells 


A distinction between parallel states and recurrent states, introduced early in this chap- 
ter, is helpful. Parallel states involve parallel events such as being employed and having 
health insurance; recurrent states involve sequential events such as the first birth, the 
second birth, and so forth. The term multiple spells refer to the durations between re- 
current spells of the same event. Joint modeling of such data has similarities with joint 
modeling of parallel states as both involve multivariate concepts, but there are also im- 
portant differences because sequential events may generate dynamic dependence in 
hazards. 

Consider some examples of recurrent events. Individuals in the labor market 
may experience a succession of transitions between employment and unemployment. 
Young workers, for example, may record a succession of spells of unemployment. 
Newman and McCulloch (1984) consider the timing of births within a hazard frame- 
work. If one wants to model the hazard rate for each birth in a series of births, con- 
sideration must be given to the correlation between interbirth durations. Trivedi and 
Alexander (1989) analyze multiple spells of youth unemployment in Australia. In the 
literature on fertility, the duration between successive births is of interest (Heckman, 
Hotz, and Walker, 1985). Mealli and Pudney (1996) analyze the positive association 
between the duration in employment and pensionable status using data from a retire- 
ment survey in the United Kingdom. Engle and Russell (1998) study the time series 
of durations between successive transactions of a particular stock traded on the stock 
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market. Stevens (1999) analyzes the persistence of poverty over individuals’ lifetimes 
taking account of multiple spells of poverty. 

The aforementioned examples have several noteworthy features. Whether the haz- 
ard rate of an event depends on a previous event, conditional on a previous event, is an 
important modeling issue. Second, the form of dependence is of interest. The duration 
of a previous spell may enter as a covariate in determining the hazard of a later event; 
the occurrence of a previous event may affect the baseline hazard for a later spell; and, 
finally, unobserved heterogeneity may show serial dependence. Each of these raises an 
important modeling issue. 

Multiple spells generate longitudinal or panel data that can potentially help to re- 
solve the important identification issue concerning the influence of dynamic depen- 
dence (“the hand of past”) relative to that of heterogeneity in the hazard function. Un- 
der some assumptions multiple observations make it easier to control for heterogeneity 
and to make inferences about dynamic dependence. 

In general, survival models with unobserved heterogeneity and dependence between 
spells can be expected to be difficult to estimate. However, multiple-spell data create 
opportunities to study issues that can be studied only if panel data are available. Oc- 
currence dependence, lagged duration dependence, and serially correlated unobserved 
heterogeneity are examples. Both lagged duration and occurrence dependence refer to 
dependence of the termination probability of the spell in progress on either the number 
or the duration of previous spells. Given such dependence, it is not appropriate to study 
spells individually, ignoring their interdependence. 

In considering the choice of a suitable econometric framework for multiple spells, 
one possibility is to model dependence using joint survival functions, as discussed 
in the preceding section. This approach takes care of the multivariate nature of the 
data. A second possibility is to use the panel data framework with the time subscript 
replaced by the spell subscript, without ignoring the possibility that calender time still 
may have relevance. Spell dependence introduces issues that will be discussed under 
the topic of dynamic panel models in Sections 22.5 and 23.6. In both these cases an 
important difference arises from the possibility of censoring because of panel attrition 
or because the most recent spell is incomplete. 


19.4.1. A Model with Two Spells 


A proportional hazards model with two spells can illustrate a number of features of 
multiple-spell models. In econometrics such models have been analyzed by Honoré 
(1993) and Horowitz and Lee (2003). 

Honoré (1993) considers a proportional hazards model of the form 


As(t|X,v) =Ao,s(Hd (x, B)v, s = 1,2. (19.24) 


Note that in this specification the baseline hazard is spell-specific, but the heterogeneity 
component, which enters multiplicatively (a key assumption), is not; that is, v repre- 
sents the fixed or permanent characteristics of an individual, and hence we have a fixed 
effects model. Under conditions similar to those for the mixed PH discussed in Chap- 
ter 18, he shows that the model is identified. He also shows that neither the assumptions 
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about the distribution of v nor the presence of the covariates is essential for 
identification. 

In a second model Honoré considers spell-specific multiplicative heterogeneity 
components vı and v2, with a joint bivariate pdf g(vı, v2). The correlation between vı 
and v could reflect serially correlated heterogeneity. This is a random effects model. 
The joint survival function S(t, f2|x) is derived by the bivariate mixing approach as 
shown in (19.19) using the mixing distribution g(v,, v2). If the marginal survival func- 
tions are identified, then the joint survival function is also identified. The identification 
conditions are essentially those for identifiability of the PH model. 

Honoré also considers the lagged duration dependence specification of the 
second-spell model under the assumption that the duration of the first spell, denoted 
tı, enters the hazard for the second-spell multiplicatively. He provides sufficient condi- 
tions for identifiability of the parameters in the second-spell conditional model, given 
covariates and t;. These conditions are not discussed here. However, under these con- 
ditions, a multiple-spells version of the proportional hazards model has the form 


Alx, V1) = Aor OX, 61M, (19.25) 
Aa(ta|X2, V2) = Ao 2(NP(x5, B22, 


where x5 = (X2, tı) is the augmented vector of covariates. Note that there is an endo- 
geneity problem here if vı and v2 are correlated since, in that case, tı and v2 cannot be 
independent. 

The previous occurrence of a spell may not simply shift the hazard function in the 
succeeding spell. It may also alter the specification of the hazard by bringing in new 
covariates. For example, an unemployment spell may induce enrollment into a training 
program, which plausibly could impact the hazard of a later spell of unemployment. 
If the training variable were treated as weakly exogenous, identification of the model 
would be under threat. This point is relevant even for the analysis of a single-spell 
model: The assumption that covariates and unobserved heterogeneity are uncorrelated 
is not innocuous. 

In some cases it may be desirable to model not only multiple spells in one state but 
also those in other related states. For example, there may be two states, employed or 
not employed, and we may be interested in not just how length of last unemployment 
spell affects the length of current unemployment spell but also in the effect of the 
intervening employment spell on the hazard out of unemployment. Further, we might 
observe data on individuals when they are in one state but not another. For example, 
administrative data may cover people when on welfare but not when off welfare. 


19.4.2. A More General Model of Multiple Spells 


To illustrate the potential computational complexity of multiple-spell models, we de- 
scribe briefly the model of Mealli and Pudney (1996). 

Let T =(t,..., Tk) denote the k-dimensional vector of complete spells, r,_1 the 
index of origin state, and rg the index of destination state. Assume independence of 
durations across spells after controlling for possible lagged duration dependence. Let 
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Aj (x;, B j) denote the destination-specific hazard function, and let x = [x),..., xx], 


B=[B,..--, Bgl. 


The joint density of spells and exit routes is given by 


A (Ti, fi, T2, F2, ++ TIX, «++ Xk ro, B) (19.26) 
= f (t1, r1|X1, ro; B)... f (Tk-1, rk-1|Xk-1; Fo, r1, - - -> Fk-2; B) 
X S(Tk|Xk, Fo, F1, -< Fk-1, B) 


k= k 
-T] (Tj |X), Bo (-X Aetna). 


l=1 


where it has been assumed that the kth spell is censored (in progress) and we use 
relationships (17.4) and (17.6). The covariates include some that vary across spells 
and possibly lagged durations. This formulation may be compared with the single- 
spell CRM formulation (19.7). 

Mealli and Pudney (1996) build an elaborate model using this formulation as 
the basis. Because they allow for unobserved heterogeneity with even more com- 
plex structure than that considered in this chapter, their computational procedure is 
also more complicated. They use the method of simulated maximum likelihood (see 
Section 12.4). 


19.5. Competing Risks Example: Unemployment Duration 


The duration examples used in Chapters 17 and 18 focused on the time in an unem- 
ployment spell, ignoring the destination state after transition. Here we implement a 
competing risk analysis of the same data used in McCall (1996). The data distinguish 
three different destination states: full-time employment in the first postdisplacement 
job, part-time employment in the first postdisplacement job, and either full-time or 
part-time status in the first postdisplacement job the employee had left by the time of 
the survey. One can therefore relax the assumption that the hazard function does not 
depend on the destination state and consider instead the competing risks formulation 
in which independent competing risks determine the duration of unemployment. 

For the McCall data set there are 1073, 339, and 574 transitions, respectively, to 
each of the three states mentioned. The third destination state lacks a clear interpre- 
tation, so the results for that case are not discussed in detail. For each transition we 
estimated four parametric duration models, exponential and Weibull, with and without 
inverse-Gaussian heterogeneity. Gamma heterogeneity was also considered but this 
model was computationally unstable. Because of the assumption of independent com- 
peting risks, estimation can be carried out one equation at a time. Selected extracts 
of the computer output, with focus only on a limited number of variables as in Chap- 
ters 17 and 18, are given in Tables 19.2 and 19.3. 
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Table 19.2. Unemployment Duration: Competing and Independent Risk Estimates of 
Exponential Model with and without IG Frailty 


Risk No Heterogeneity IG Heterogeneity 
Coefficient Risk 1 Risk 2 Risk 3 Risk 1 Risk 2 Risk 3 
Transitions 1,073 339 574 1,073 339 574 
RR 472 —.092 —.600 504 —.185 —.562 
(.601) (.976) (.725) (.614) (1.025) (.744) 
DR —.575 —.959 1.122 —.806 —1.051 1.078 
(.762) (1.247) (.901) (.781) (1.295) (.921) 
UI —1.424 —1.047 —.966 —1.544 —1.092 —.963 
(.249) (.524) (.449) (.258) (.544) (.456) 
RRUI .966 —.669 —.432 1.057 —.742 —.482 
(.612) (1.192) (1.014) (.627) (1.23) (1.033) 
DRUI —.198 1.987 2.102 —.012 2.18 2.158 
(1.019) (1.727) (1.303) (1.041) (1.788) (1.323) 
LNWAGE 351 —.257 .003 373 —.321 —.007 
(.116) (.179) (.145) (.118) (.191) (.147) 
TENURE 0 .005 —.047 .0006 .007 —.047 
(.006) (.013) (.012) (.007) (.014) (.012) 
—ln L 5,693.63 5,687.64 


19.5.1. Estimates under Competing Risks Framework 


Pairwise comparison of exponential models with and without heterogeneity shows an 
improvement in the log-likelihood results from the introduction of unobserved het- 
erogeneity. This result is similar to the pattern reported in Section 18.8. However, the 
Weibull model without heterogeneity has a significantly higher log-likelihood than the 
exponential model, —5,666 against —5,693. The Weibull model with inverse-Gaussian 
heterogeneity has the highest log-likelihood, —5,543, and seems to be the best of the 
four models. This should not be interpreted to mean that it is a satisfactory model 
for inference — that issue remains open. Henceforth we shall discuss the results in 
Table 19.3. 

Introduction of unobserved heterogeneity in the Weibull model leads to a substantial 
increase in estimate of the hazard function slope coefficient in all three hazard func- 
tions. This coefficient increases from 1.29 to 1.75 for risk 1, and from 1.08 to 1.65 for 
risk 2. That is, the introduction of unobserved heterogeneity leads to a stronger indica- 
tion of decreasing duration dependence or steeply rising hazard out of unemployment. 
These changes are along the lines predicted by the analysis of Section 18.5. In the 
Weibull model the impact of adding unobserved heterogeneity on the coefficient of 
unemployment insurance (UT) is also quite substantial, becoming substantially larger 
in absolute magnitude. The coefficients of RR, DR, RRUI, and DRUI remain impre- 
cisely determined. The coefficient of LNWAGE is significant and positive in the first 
hazard function, but not in the second. That is, the increase in LNWAGE accelerates 
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Baseline Survival Functions 
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Figure 19.1: Unemployment duration: estimated baseline survival functions from the Cox 
Competing Risks model. U.S. data from 1986-92 on 3343 spells, some incomplete. 


transition out of unemployment of those seeking full-time employment but has a neg- 
ligible impact on those who transit to part-time employment. This exemplifies how the 
competing risks framework may allow us to distinguish between the different role of a 
variable in different hazard functions. 

Also consider the Cox model specification of the competing risks model given in 
Section 19.2. In this specification unobserved heterogeneity is ignored and the base- 
line hazard is not parametrically specified, but it can be estimated as explained in 
Section 17.8.3. The point estimates, comparable to those for the exponential model 
in Table 19.2, are given in the last three columns of Table 19.3, but the standard er- 
rors are much larger, as the Cox specification is less restrictive than the exponential. 
The estimated coefficient of unemployment insurance is closer to that in the exponen- 
tial model than to that in the Weibull-IG model; the latter is almost twice as large. 
The LNWAGE coefficient is also larger in the Weibull-IG model. However, given that 
unobserved heterogeneity is ignored, identification of the baseline hazard is not possi- 
ble. Figures 19.1 and 19.2 show, respectively, the computed baseline survival functions 
and the cumulated hazard functions for the three destinations, but these are better inter- 
preted as reflecting some unknown mixture of unobserved heterogeneity and duration 
dependence. These estimates show that the baseline survival function for those tran- 
siting to full-time employment is the lowest and lies below the other two, and that for 
those transiting to part-time employment it is the flattest and the highest. Correspond- 
ingly, the cumulated hazard function for those transiting to full-time employment is 
the steepest of the three. 

The discussion and analysis presented here is only illustrative, not final in any sense. 
Indeed, there remain good reasons to suggest that the Weibull hazard function is a mis- 
specification. McCall’s (1996) analysis of the same data set allows for a more flexible 
polynomial hazard function and comes up with evidence supporting a bathtub-shaped 
hazard, which implies decreasing hazard at low durations, then fairly constant and 
eventually rising hazard at high durations. The monotonic Weibull hazard func- 
tion does not capture this possibility. The experience of other researchers modeling 
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Figure 19.2: Unemployment duration: estimated baseline cumulative hazards from the Cox 
Competing Risks model. Same data as Figure 19.1. 


unemployment duration using the U.S. data has revealed that when the hazard func- 
tion is flexibly specified, the introduction of unobserved heterogeneity does not have a 
large impact on the results (Meyer, 1990; Han and Hausman, 1990). The fact that we 
do not see that result here should motivate the use of a more flexible specification such 
as the one analyzed in Section 17.10. 


19.6. Practical Considerations 


In modeling multivariate survival models it is practical to begin with marginal models 
before undertaking simultaneous estimation. Such a strategy can be helpful in assess- 
ing the statistical adequacy of the initial specification. 

At the time of this writing, the statistical implementation of multivariate survival 
and hazard models will in most cases require one’s own programming, a task that can 
be partially eased by the use of supporting software such as optimization programs for 
maximization or minimization of user-defined functions using functions and program- 
ming language offered by many programs and programming platforms. 

The CRM with independent risks reduces to estimation of a series of survival mod- 
els for which practical use information was given in Section 17.12. Programs for gen- 
eral multivariate CRM are not easy to find in commercial software. Some multivariate 
survival models with special dependence structure are supported. For example, STATA 
supports computation of the shared frailty model. A shared frailty model is a random 
effects model where the components of unobserved heterogeneity are common to, or 
shared among, groups of individuals or spells and are randomly distributed across 
groups. 

If the main interest is in modeling the dependence structure among durations, the 
copula approach, because it does not require numerical integration, is potentially at- 
tractive relative to maximum simulated likelihood for the bivariate case. For dimen- 
sions higher than two, as in the case of multiple-spell models, it is feasible but there 
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are relatively few examples in the published literature. Marginal models can be fitted 
and tested using standard univariate survival models, and the dependence parameter 
can be estimated in a sequential second-stage procedure. Even if all parameters are 
to be estimated simultaneously the estimated marginal models provide a useful set of 
starting values for the iterative computation. We are unaware of statistical software 
that supports the estimation of these models. 


19.2 


19.3 


19.4 


19.7. Bibliographic Notes 


Han and Hausman (1990) give an empirical example of CRM in which the specification is 
generalized to allow for unobserved heterogeneity. Within the framework of the CRM with 
state-specific random effects, McCall (1996) analyzes the impact of some policy variables 
on the behavior of the insured unemployed seeking part-time work using the CRM model 
with correlated risks. In Butler, Anderson, and Burkhauser (1989) the hazards of accepting 
a job and of dying are modeled using a CRM with correlated risks. 

Sklar’s pioneering article on copulas appeared in 1959 in French, but Sklar (1973) is a 
good substitute in English. Radulović and Wegkamp (undated) provide a proof of Sklar’s 
Theorem. A very helpful guided tour of the copula literature with an annotated bibliogra- 
phy is given by Frees and Valdez (1998). 

Multiple spells are studied by Mealli and Pudney (1996) and by Flinn and Heckman 
(1982). Mealli and Pudney (1996) analyze transitions among pensionable jobs, nonpen- 
sionable jobs, and other labor market states using simulation-based estimation methods. 


Exercises 


19-1 (Adapted from Sapra, 2000; 2001). This problem involves an example that illus- 


trates the Cox—Tsiatsis nonidentification of the competing risks result mentioned 
in Section 19.2. Consider the following dependent competing risks model in 
which we observe T = min(7;, T2) and 6, where 6 = 1 if T= h, and ê = 2 if 
T= Tg. Here h and 7p are latent durations of risks 1 and 2, respectively. Sup- 
pose that the bivariate joint survivor function is S(t, t) = exp[—(Ai4 + à2t)°], 
0 <a <1, Aj, A2 > 0. Construct an independent CRM that is equivalent to the 
specified dependent competing risks model. 


19-2 For the model specified in the preceding problem, write down the log-likelihood 


function for each model in terms of hazard rates and integrated hazard rates, if 
both T and ô are observed. Examine the information matrix of the parameters, 
and show that all the parameters are locally identified because it is nonsingular. 


19-3 Consider two parallel durations, say duration of unemployment, T4, and the du- 


ration of a spell without private health insurance, Tz. Assume that conditional 
on unobserved heterogeneity the durations are independent and exponentially 
distributed with means £o + 81X and yo + 1X, respectively. Suppose that multi- 
plicative unobserved heterogeneity terms for the two duration models are vy and 
vo, with E[vy] = E[ve] = 1. 


(a) For parameter values of your choice, write an algorithm to generate cor- 
related realizations for (v1, v2) such that unconditionally on (v4, v2), but 
conditionally on x, the two durations will be correlated. You are free to 
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make distributional assumptions for the joint distribution of (v1, v2) that are 
appealing on grounds of mathematical convenience or other considera- 
tions. Explain how you can control the extent of correlation between the 
two durations. 

(b) Using the technique for obtaining a bivariate joint distribution given in 
Section 19.3.2, derive the joint distribution of durations. 

(c) Describe how you might extend the analysis of part (b) to allow for the pres- 
ence of right-censored durations. 


19-4 Using the same subsample of the McCall data set as in Chapter 18, estimate 
using a two-state model with unemployment and employment as the two states, 
(i.e., ignoring the distinction between part-time and full-time employment as two 
alternative destinations). 

(a) Fit the single-equation Weibull model and compare the results with those for 
independent CRM with the Weibull specification. 

(b) Evaluate the improvement in goodness of fit resulting from the CRM speci- 
fication. 

(c) Evaluate and compare the fitted values of the hazard out of unemployment, 
evaluated at sample averages of the explanatory variables, from the single 
equation and the CRM models. 
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CHAPTER 20 


Models of Count Data 


20.1. Introduction 


In many economic contexts the dependent or response variable of interest is a non- 
negative integer or count that we wish to explain or analyze in terms of a set of re- 
gressors. Unlike the classical regression model, the response variable is discrete, with 
a distribution that places probability mass at nonnegative integer values only. Several 
models discussed earlier in the book, such as the binary outcome model and the du- 
ration model, can be shown to be closely related to the count data regression model. 
Regression models for counts, like other limited or discrete dependent variable models 
such as the logit and probit, are nonlinear with many properties and special features 
intimately connected to discreteness and nonlinearity. 

Let us consider some examples from microeconometrics, beginning with sample 
data that are independent cross-section observations. Fertility studies often model the 
number of live births over a specified age interval of the mother, with interest in an- 
alyzing its variation in terms of, say, mother’s schooling, age, and household income 
(Winkelmann, 1995). In some models of family decisions the number of children may 
appear as an explanatory variable with the acknowledgment that the variable is en- 
dogenous. Accident analysis studies model airline safety as measured by the number 
of accidents experienced by an airline over some period and seek to determine its rela- 
tionship to airline profitability and other measures of the financial health of the airline 
(Rose, 1990). Recreational demand studies seek to place a value on natural resources 
such as national forests by modeling the number of trips to a recreational site (Gurmu 
and Trivedi, 1996). Health demand studies model data on the number of times that 
individuals consume a health service, such as visits to a doctor or days in the hospital 
in the past year (Cameron et al., 1988). If we wish to analyze the relation between this 
variable and factors such as health status and health insurance, again a count regression 
is relevant. 

The main modeling approaches are presented in Sections 20.2—20.5. Section 20.2 
details the Poisson regression model. Section 20.3 gives an application to data from the 
famous RHIE. The Poisson regression model is often too restrictive and other, more 
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Table 20.1. Proportion of Zero Counts in Selected Empirical Studies 


Sample Proportion 


Study Variable Size of Zeros 
Cameron et al. (1988) Doctor visits 5,190 0.798 
Pohlmeier and Ulrich (1995) Specialist visits 5,096 0.678 
Grootendorst (1995) Prescription drugs 5,743 0.224 
Deb and Trivedi (1997) Number of hospital stays 4,406 0.806 
Gurmu and Trivedi (1996) Recreational trips 659 0.632 
Geil et al. (1997) Hospitalizations 30,590 0.899 


Greene (1997) Major derogatory reports 1,319 0.803 


commonly used, fully parametric count models are presented in Section 20.4. Less- 
used alternative parametric approaches for counts, such as discrete choice models, are 
also presented in this section. The partially parametric approach of modeling the con- 
ditional mean and conditional variance is detailed in Section 20.5. Multivariate count 
models and models with endogenous regressors are given an introductory treatment in 
Section 20.6. Section 20.7 illustrates various models by application to the RHIE data. 
This is followed by a discussion of some practical issues. For pedagogical reasons 
the Poisson regression model for cross-section data is presented in some detail. The 
other models, many superior to Poisson, are presented in less detail for space reasons. 
For more complete treatment see Cameron and Trivedi (1998) and the Bibliographic 
Notes. 


20.2. Basic Count Data Regression 


In some cases, such as number of births, the count is the variable of ultimate inter- 
est. In other cases, such as medical demand and results of research and development 
expenditure, the variable of ultimate interest is continuous, often expenditures or re- 
ceipts measured in dollars, but the best data available are instead a count. In many 
cases, the sample is concentrated on a few small discrete values, say 0, 1, and 2. 
Table 20.1 illustrates this point by reference to the proportion of zero counts observed 
in several published econometric models; this proportion can be as high as 90% in 
some cases. Also, the data can be skewed to the right. Finally, the data are intrinsi- 
cally heteroskedastic with variance increasing with the mean. 


20.2.1. Poisson Regression 


The Poisson is the starting point for count data analysis, though it is often inadequate. 
In Sections 20.2.1—20.2.3 we present the Poisson regression model, which was pre- 
viously introduced in Section 5.2, and estimation by maximum likelihood, interpre- 
tation of the estimated coefficients, and extensions to truncated and censored data. In 
Section 20.2.3 we also present the quasi-MLE based on the Poisson distribution with 
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Table 20.2. Summary of Data Sets Used in Recent Patent-R&D Studies 


Sample Std. Maximum Proportion 
Study Size Mean Error Patents of Zeros 
Cincera (1997) 181 60.8 721.6 925 <0.19 
Crepon and Duguet (1997b) 698 11.6 naf na 0.441 
Crepon and Duguet (1997a) 451 2.73 11.45 na 0.729 
Hausman et al. (1984) 346 32.1 66.36 515 0.220 
Wang et al. (1998) 70 23.46 39.10 173 0.186 


a na, not available. 


correctly specified conditional mean, but with possibly misspecified conditional vari- 
ance function. Limitations of the Poisson model, notably its property of equidispersion, 
are presented in Section 20.2.4. 

There is a qualification: In some cases a high proportion of zeros in the sample 
may coexist with very large values of counts, creating a difficult modeling challenge. 
Table 20.2 illustrates this feature using information from five studies that have inves- 
tigated the relationship between patent counts and research and development (R&D) 
expenditure. Observe how large the maximum observed value of the count is relative 
to the sample mean. The modeling challenge is to select a functional form that can 
adequately capture the large mean and the high proportion of zeros. In many other 
examples, such as number of births, virtually all the data are restricted to single digits, 
and the mean number of events is quite low. 

These features motivate the application of special methods and models for count 
regression. There are two ways to proceed. 

The first approach is a fully parametric one that completely specifies the distribu- 
tion of the data, fully respecting the restriction of y to nonnegative integer values. This 
approach was taken in early applications, mostly in biostatistics, where count regres- 
sion was seen as an extension and generalization of a vast literature on the distribution 
of independent and identically distributed counts. It was also taken in the influential 
econometrics study by Hausman et al. (1984). 

The second approach is a mean-variance approach, which specifies the condi- 
tional mean to be nonnegative and specifies the conditional variance to be a function 
of the conditional mean. This models well the nonnegativity and heteroskedasticity 
but does not address the discreteness of the data. This approach, in a framework not 
limited to only count data, was introduced by Nelder and Wedderburn (1972), lead- 
ing to the generalized linear model approach widely used in statistics (McCullagh and 
Nelder, 1989). In econometrics this approach was introduced by Gouriéroux, Monfort, 
and Trognon (1984a,b) and is best viewed as a specialization of generalized methods 
of moments. 


20.2.2. Poisson MLE and QMLE 


The Poisson MLE and quasi-MLE (QMLE) were introduced and studied in Chapter 5 
as an example of m-estimation. Here we give a more complete treatment. 
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The natural stochastic model for counts is a Poisson point process for the occur- 
rence of the event of interest. This implies a Poisson distribution for the number of 
occurrences of the event, with density, or more formally probability mass function, 

ey 


Py Syl 
y: 


y=0,1,2,..., (20.1) 


where u is the intensity or rate parameter. We refer to the distribution as P[u]. The 
first two moments are 


E[Y] = u, (20.2) 
VIY] = u. 


This shows the well-known equidispersion (equality of mean and variance) property 
of the Poisson distribution. 

By introducing the observation subscript i, attached to both y and y, the iid frame- 
work is extended to the regression case. The Poisson regression model is derived from 
the Poisson distribution by parameterizing the relation between the mean parameter u 
and covariates (regressors) x. The standard assumption is to use the exponential mean 
parameterization, 


ui = exp(x; 6), i=1,..., N, (20.3) 


where by assumption there are K linearly independent covariates, usually including a 
constant. Because V[y;|x;] = exp(x; 3), by (20.2) and (20.3), the Poisson regression is 
intrinsically heteroskedastic. 

Given (20.1) and (20.3) and the assumption that the observations (y;|x;) are inde- 
pendent, the most natural estimator is maximum likelihood. The log-likelihood func- 
tion is 


N 
InL(8) = } \{y;x;8 — exp(x/3) — In y;!}. (20.4) 
i=l 


The Poisson MLE, denoted Bp. is the solution to K nonlinear equations corresponding 
to the first-order condition for maximum likelihood, 


N 
Y (i — exp(x}))x; = 0. (20.5) 
i=l 


If x; includes a constant term then the residuals y; — exp(x; 6) sum to zero by (20.5). 
The log-likelihood function is globally concave; hence solving these equations by 
a Gauss—Newton or Newton-Raphson iterative algorithm yields unique parameters 
estimates. 

In the econometrics literature pseudo-ML (PML) or quasi-ML (QML) estimation 
refers to estimating by ML, under misspecification of the specified density (Gourieroux 
et al., 1984a). The terms PML and QML are often used interchangeably. The distribu- 
tion of the estimator is obtained under weaker assumptions about the data-generating 
process than those that led to the specified likelihood function; see Section 5.7. In the 
statistics literature QML often refers to nonlinear generalized least-squares estimation. 
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For the Poisson regression, QML in the latter sense is equivalent to standard maximum 
likelihood. 

From (20.5), the Poisson PML estimator, Bo. has first-order conditions yt Yi — 
exp(x;6))x; = 0. As already noted, the summation on the left-hand side has expec- 
tation zero if E[y;|x;] = exp(x;). Hence the Poisson PML is consistent under the 
weaker assumption of correct specification of the conditional mean; that is, the data 
need not be Poisson distributed. Using results given in Section 5.2.3, the variance ma- 
trix is of the sandwich form, with 


N -1 
Vem [8p] = (£ pax) c W;X;X x) 2 [iX:X ) (20.6) 
jel 


and w; = V[y;|X;] is the conditional variance of y;. 

By standard ML theory if the stronger assumption is made that the Poisson regres- 
sion is parametrically correctly specified, so that œw; = u;, the estimator Bp is consis- 
tent for 8 and asymptotically normal with the sample covariance matrix 


N -1 
V[Bp] = (x uxx) : (20.7) 
i=l 


in the case where u; is of the exponential form (20.3). 

The Poisson ML and PML estimators are identical but have different variances. The 
empirical implementation of the more robust estimate (20.6) is presented in Section 
20.5.1. 


20.2.3. Interpretation of Regression Coefficients 


For linear models, with E[y|x] = x’, the coefficients 8 are readily interpreted as the 
effect of a one-unit change in regressors on the conditional mean. For nonlinear mod- 
els this interpretation needs to be modified; see the general discussion given in Sec- 
tion 5.2.4. For any model with exponential conditional mean, differentiation yields 


dE[y|x] 
Ox; 


= ßj exp(x’B), (20.8) 


where the scalar x; denotes the jth regressor. For example, if B, = 0.25 and 
exp(x, B= = 3, then a one-unit change in the jth regressor increases the expectation 
of y by 0.75 units. This partial response depends on exp(x; B), which is expected to 
vary across individuals. It is easy to see that 6; measures the relative change in E[y|x] 
induced by a unit change in x;. If x; is measured on a logarithmic scale, 6; is an 
elasticity. 

For purposes of reporting a single response value, a good candidate is an estimate of 
the average response, N7! $; JE[y; |x; ]/0x;; = B; x No! oa exp(x, 3). For Poisson 
regression models with intercept included, this can be shown to simplify to £ ;y. 

Another consequence of (20.8) is that if, say, B; is twice as large as Bx, then the 
effect of changing the jth regressor by one unit is twice that of changing the kth re- 
gressor by one unit. 
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20.2.4. Overdispersion 


The Poisson regression model is usually too restrictive for count data, leading to alter- 
native models presented in Sections 20.3 and 20.4. The fundamental problem is that 
the distribution is parameterized in terms of a single scalar parameter (u) so that all 
moments of y are a function of u. By contrast the normal distribution has separate 
parameters for location (u) and scale (o7). For the same reason the one-parameter 
exponential is too restrictive for duration data and more general two-parameter dis- 
tributions such as the Weibull are superior. Note that this complication does not arise 
with binary data. Then the distribution is clearly the one-parameter Bernoulli, because 
if the probability of success is p then the probability of failure must be 1 — p. For 
binary data the issue is instead how to parameterize p in terms of regressors. 

One way this restrictiveness manifests itself is that in many applications a Poisson 
density predicts the probability of a zero count to be considerably less than is actually 
observed in the sample. This is termed the excess zeros problem, as there are more 
zeros in the data than the Poisson predicts. 

A second and more obvious deficiency of the Poisson model is that for count data 
the variance usually exceeds the mean, a feature called overdispersion. The Poisson 
model instead implies equality of the variance and the mean (see (20.2)), a property 
called equidispersion. 

Overdispersion has qualitatively similar consequences to the failure of the assump- 
tion of homoskedasticity in the linear regression model. Provided the conditional mean 
is correctly specified, that is, (20.3) holds, the Poisson MLE is still consistent. This is 
clear from inspection of the first-order conditions (20.5), since the left-hand side of 
(20.5) will have expected value of zero if E[y;|x;] = exp(x; 6). This consistency prop- 
erty applies more generally to the quasi-MLE when the specified density is in the 
LEF. Both Poisson and normal are members of the LEF discussed earlier in Sec- 
tion 5.7.3. It is nonetheless important to control for overdispersion. First, in more 
complicated settings such as with truncation and censoring, overdispersion leads to 
the more fundamental problem of inconsistency. Second, even in the simplest settings 
large overdispersion leads to grossly deflated standard errors and grossly inflated t- 
statistics in the usual ML output, and hence it is important to use the previously given 
robust variance estimator. Third, if one wants to estimate probabilities of number of 
events, rather than merely the conditional mean, these depend on additional parameters 
of the dgp. 

Overdispersion may signal a presence of a more basic misspecification, especially 
in data settings that involve truncation and censoring if they are ignored in estima- 
tion. In such a case the conditional mean is incorrectly specified and the simultaneous 
presence of overdispersion then leads to inconsistency, not only inefficiency, of the 
MLE. 

A statistical test of overdispersion is therefore highly desirable after running a 
Poisson regression. Most count models with overdispersion specify overdispersion to 
be of the form 


V[yilxi] = Hi + wg (ui), (20.9) 
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where œ is an unknown parameter and g(-) is a known function, most commonly 
g() = u? or g(u) = u. It is assumed that under both null and alternative hypothe- 
ses the mean is correctly specified as, for example, exp(x; 6), whereas under the 
null hypothesis æ = 0 so that V[y;|x;] = u;. A simple overdispersion test statistic 
for Ho : œ = 0 versus H; : æ #0 or H; : a > 0 can be computed by estimating the 
Poisson model, constructing fitted values 7; = exp(x,), and running the auxiliary 
OLS regression (without constant) 

Qi- -yi gA) 

= Q 


fi fi 


+ üi, (20.10) 


where u; is an error term. The reported t-statistic for œ is asymptotically normal under 
the null hypothesis of no overdispersion (Cameron and Trivedi, 1990) even though 
here generated regressors are used. This test can also be used for underdispersion, 
a < 0, in which case the conditional variance is less than the conditional mean. See 
also Gurmu and Trivedi (1992). 


20.3. Count Example: Contacts with Medical Doctor 


For illustration we use some of the data from the RAND Health Insurance Experi- 
ment previously used by Deb and Trivedi (2002). They estimated a more complete set 
of models and carried out a deeper analysis of the data than is possible or desirable 
here. The experiment, conducted by the RAND Corporation from 1974 to 1982, has 
been the longest running and largest controlled social experiment in medical care re- 
search. The main goal of the experiment was to assess how the patient’s use of health 
services is affected by types of randomly assigned health insurance, including both 
fee-for-service and health maintenance organizations (HMOs). In the experiment the 
data were collected from about 8,000 enrollees in 2,823 families, from six sites across 
the country. Each family was enrolled in one of 14 different health insurance plans for 
either three or five years. The plans ranged from free care to 95% coinsurance below a 
maximum dollar expenditure (MDE), and also included assignment in a prepaid group 
practice. 

The key point is that because insurance plans are randomly assigned, not freely 
chosen by the participants, we do not face the problem of endogenous treatment effect, 
which is the central causal parameter of interest in the study. 

Data were collected from the enrollee’s use of medical care services and health sta- 
tus throughout the randomly assigned term of enrollment for either three or five years. 
For additional details of the data see Manning et al. (1987), Newhouse et al. (1993), 
and Deb and Trivedi (2002). The sample used in this study consists of individuals in 
the fee-for-service plans only. 

The data file consists of utilization, expenditures, demographic characteristics, 
health status, and insurance status variables. The expenditure data were analyzed in 
Section 16.6. The coinsurance rate in this sample assumes four different values. Yet, 
following the RAND studies, we treat it as a continuous variable. The final sample 
consists of 20,186 observations; each observation represents data for an experimental 
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Table 20.3. Contacts with Medical Doctor: Frequency Distribution 


Contacts 0 1 2 3 4 5 6 7 8 9 10 
Relative Frequency 31.2 18.9 13.8 93 67 48 34 26 2.0 14 1.0 
Contacts 11 12 13 144 15 16 >21 Max 


Relative Frequency 0.9 06 05 04 03 0.3 10 77 


subject in a given year. For simplicity of exposition the resulting clustering in the data, 
see Section 24.5, is ignored here. 

In the present illustration the measure of utilization analyzed is the number of con- 
tacts with a medical doctor (MDU). The relative frequency distribution of MDU, given 
in percentages, is given in Table 20.3. MDE denotes maximum dollar expenditure, the 
medical expenditure liability limit set in the experiment above which the participant 
would not be responsible for cost-sharing. Observe that about 31% of the observations 
are zeros. The long right tail and variance greatly exceeding the mean indicates that 
the counts are (unconditionally) overdispersed. 

For the purposes of discussion here we consider the regression to be estimated by 
Poisson ML and by Poisson PML. Other specifications are considered later. The in- 


cluded covariates in all cases are those in Table 20.4. 


Table 20.4. Contacts with Medical Doctor: Variable Descriptions 


Variable Definition Mean Std. Dev. 
MDU Number of outpatient visits to an MD 2.861 4.505 
LC In(coinsurance + 1), 0 < coinsurance < 100 1.710 1.962 
IDP 1 if individual deductible plan, 0 otherwise 0.220 0.414 
LPI In(max(1,annual participation incentive payment)) 4.709 2.697 
FMDE 0 if IDP= 1 3.153 3.641 
In(max(1,MDE/(0.01 coinsurance))) otherwise 
LINC In(family income) 8.708 1.228 
LFAM In(family size) 1.248 0.539 
AGE Age in years 25.718 16.768 
FEMALE 1 if person is female 0.517 0.500 
CHILD 1 if age is less than 18 0.402 0.490 
FEMCHILD FEMALE * CHILD 0.194 0.395 
BLACK 1 if race of household head is black 0.182 0.383 
EDUCDEC Education of the household head in years 11.967 2.806 
PHYSLIM 1 if the person has a physical limitation 0.124 0.322 
NDISEASE Number of chronic diseases 11.244 6.742 
HLTHG 1 if self-rated health is good 0.362 0.481 
HLTHF 1 if self-rated health is fair 0.077 0.267 
HLTHP 1 if self-rated health is poor 0.015 0.121 


Omitted category is excellent self-rated health 
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Table 20.5. Contacts with Medical Doctor: Count Model Estimates 


Poisson PPML NB2-PML 

Model Coeff. t-ratio t-ratio Coeff. t-ratio 
LC —.0427 —7.030 —2.835 —0.0504 —3.228 
IDP —.1613 — 13.881 —5.773 —0.1475 —4.889 
LPI 0.0128 6.999 2.912 0.0158 3.574 
FMDE —.0206 —5.803 — 2.319 —0.0213 —2.351 
PHYSLIM 0.2684 21.711 8.240 0.2751 8.068 
NDISEASE 0.0231 38.124 13.487 0.0259 15.324 
HLTHG 0.0394 4.109 1.699 0.0065 0.275 
HLTHF 0.2531 15.613 5.894 0.2368 5.425 
HLTHP 0.5216 19.150 6.966 0.4256 6.205 
a — - - 1.1822 8.926 
—ln L 60087 42777 


A selection of interesting coefficients and their t-ratios are given in Table 20.5, 
along with log-likelihood and information criteria. To save space we do not reproduce 
all the output. The coefficients of variables associated with insurance variables (LC, 
IDP, LPI, and FMDE) are clearly of interest since they reflect the price sensitivity 
of utilization. Also of interest are the coefficients of the five health status variables 
(PHYSLIM, NDISEASE, HLTHG, HLTHE, and HLTHP). 

Consider the coefficient of the coinsurance rate, here measured on the log scale, 
LC. This variable is of major interest as it provides information about the price effect. 
The higher the coinsurance rate, the greater will be the extent of cost sharing by the 
patient, and hence the lower will be the average number of visits. The estimated coef- 
ficient from the Poisson regression (see column 1 in Table 20.5) is negative (—.042), 
with a ¢-ratio of 2.835, indicating that the price effect is significantly negative as pre- 
dicted by standard theory. The elasticity of the number of doctor visits with respect to 
LC is —.042. However, care should be exercised in interpreting this value as the coin- 
surance rate only takes a few values and does not vary continuously. Subject to this 
qualification, the coefficient can be interpreted as elasticity. A similar value for log of 
income (LINC) is 0.174, indicating that increase in income raises the average number 
of visits. 

How well does the Poisson regression fit the data? One simple way to judge this 
is to compare the actual and fitted frequencies for different number of doctor visits. 
Table 20.6 provides such a comparison for up to nine visits, ignoring the higher fre- 
quencies that collectively account for less than 10% of the visits. To calculate the fitted 
value Pr[y; Ix; 3] for y; = 0,1,...,9, we plug £; into (20.1) and then average over all 
the observations. Observe that the Poisson regression seriously underpredicts the pro- 
portion of zero visits and overestimates the proportion of positive number of visits up 
to seven. Thus we conclude that the Poisson regression is deficient. This pattern in the 
lack of fit can be shown to be associated with the neglect of overdispersion in the data 
(Cameron and Trivedi, 1998, chapter 4). 
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Table 20.6. Contacts with Medical Doctor: Observed and Fitted Frequencies 


Contact frequency 0 1 2 3 4 5 6 7 8 9 


Relative frequency 31.2 18.9 138 93 67 48 34 26 20 14 
Poisson fitted 10.6 19.2 209 176 126 7.99 469 2.64 1.46 0.8 
NB2 fitted 30.9 19.6 13.6 9.67 6.97 5.07 3.70 2.72 20 1.47 


In the presence of neglected overdispersion it is to be expected that the t-ratios of 
the Poisson MLE will be inflated. A comparison with the robust f-ratios in column 3 
(PPML) of Table 20.5 shows that this is indeed so. For example, robustification causes 
the t-ratio of LC to drop from —7.03 to —2.83. Tables 20.5 and 20.6 include results 
for the NB2 model that are discussed in Section 20.7. The NB2 model is a better 
parametric model for these data. 


20.4. Parametric Count Regression Models 


Poisson regression is often too restrictive. In this section we present a number of more 
flexible parametric alternatives to the Poisson. 

First, overdispersion in count data may be due to unobserved heterogeneity. In such 
a case counts are viewed as being generated by a Poisson process (in which case the 
events are serially independent), but the researcher is unable to correctly specify the 
rate parameter of this process. Instead, the rate parameter is itself a random variable. 
This mixture approach, presented in Sections 20.4.1 and 20.4.2, leads to the widely 
used negative binomial model. 

Second, overdispersion, and in some cases underdispersion, may arise because the 
process generating the first event may differ from that determining later events. For ex- 
ample, an initial doctor consultation may be solely a patient’s choice, whereas subse- 
quent visits are also determined by the doctor. This leads to the modified count models 
presented in Section 20.4.5. 

Third, overdispersion in count data may be due to failure of the assumption of in- 
dependence of events, which is implicit in the Poisson process. One can postulate 
dependence so that, for example, the occurrence of one doctor visit makes subse- 
quent doctor visits more likely. (This approach has not been widely used in count 
data analysis. In duration data analysis this is called true state dependence.) Particular 
assumptions about unobserved heterogeneity or dependence again lead to the negative 
binomial; see Winkelmann (1995). A discrete choice model that progressively models 
Pr[y = j|y > j — 1] is presented in Section 20.4.6. 

Fourth, one can refer to the extensive and rich literature on univariate iid count 
distributions, such as the logarithmic series and hypergeometric distribution (Johnson, 
Kotz, and Kemp, 1992). New regression models can be developed by letting one or 
more distribution parameters be a specified function of regressors. Such models are 
not presented here. The approach has less motivation than the first three approaches 
and the resulting models may not be any better. 
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Although overdispersion has been emphasized, underdispersion may also arise. For 
example, a sample in which the counted outcome is largely 0 or 1, with a very small 
number of 2s, and hence close to a binomial model, will show underdispersion. Mem- 
bers of the Katz family of distributions, or other distributions based on the series ex- 
pansion methods such as those developed in Cameron and Johansson (1997), can be 
used; see also Cameron and Trivedi (1998, chapter 12). 


20.4.1. Negative Binomial Model 


The negative binomial model, a specific example of a continuous mixture model, can 
be obtained in many different ways. The following justification using a mixture distri- 
bution is one of the oldest and has wide appeal. 

Suppose the distribution of a random count y is Poisson, conditional on the pa- 
rameter A, so that f(y|A) = exp(—A)A”/y!. Suppose now that the parameter À is 
random, rather than being a completely deterministic function of regressors x. In 
particular, let à = uv, where m is a deterministic function of x, for example exp(x’), 
and v > 0 is iid with density g(v|æ). This is an example of unobserved heterogene- 
ity, as different observations may have different à (heterogeneity) but part of this 
difference is due to a random (unobserved) component v. Note that E[A|u] = u 
if E[v] = 1, so the interpretation of the slope parameters stays as in the Poisson 
model. 

The marginal density of y, unconditional on the random parameter v but conditional 
on the deterministic parameters jz and q, is obtained by integrating out v. This yields 


h(y|u, a) = f fOlu, VgWlady, (20.11) 


where g(v|q@) is called the mixing distribution and a denotes the unknown parameter 
of the mixing distribution. The integration defines an “average” distribution. For some 
specific choices of f(-) and g(-), the integral will have an explicit or closed-form 
solution. 

If f(y|A) is the Poisson density and g(v) = v’'e~'°5°/T (8), v, > 0, is the 
gamma density with E[v] = 1 and V[v] = 1/6, we obtain the negative binomial as a 
mixture density as follows: 


h[yiu, 5] ee cs A (20.12) 
Jo] = v ; 
me" Io a TD) 
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where a = 1/5, T(-) denotes the gamma integral which specializes to a factorial for 
an integer argument, and the fourth line follows after some algebra and use of the 
definition of the gamma function. Special cases of the negative binomial include the 
Poisson (œ = 0), ‘the advantage of reparametrization from ô to a, and the geometric 
(a = 1). 

As in the case of many mixture distributions, the negative binomial also has inde- 
pendent justification; see Cameron and Trivedi (1998, chapter 4). It can arise in many 
different ways and one need not always think of it as a mixture distribution. 

The algebraic derivation of the negative binomial as a Poisson-gamma mixture 
can be given a Bayesian interpretation. The prior distribution of u is gamma, given 
a, and the results on conjugate priors for exponential families in Section 13.2.4. It is 
expected that the posterior distribution has a closed form. Therefore, the MLE and the 
Bayesian posterior mean, under the further assumption of a vague prior on œ, would 
coincide. 

The first two moments of the negative binomial distribution are 


Ely|“, a] = p, (20.13) 
Viylu, a] = wd + æu). 


The variance therefore exceeds the mean, since a > 0 and u > 0. Indeed, it can be 
shown easily that overdispersion always arises if y|A is Poisson and unobserved het- 
erogeneity is of the multiplicative form à = uv, where E[v] = 1. Note also that the 
overdispersion is of the form (20.9) discussed in Section 20.2.4. 

Two standard variants of the negative binomial are used in regression applications. 
Both variants specify u; = exp(x; 6). The most common variant lets œ be a param- 
eter to be estimated, in which case the conditional variance function, u + a? from 
(20.13), is quadratic in the mean. 

The other variant of the negative binomial model has a linear variance function, 
Viylu, a] = (1 + y)ųu, obtained by replacing a by y/u throughout (20.12). Estima- 
tion by ML is again straightforward. Sometimes this variant is called negative bino- 
mial 1 (NB1) in contrast to the variant with a quadratic variance function which has 
been called the negative binomial 2 (NB2) model (Cameron and Trivedi, 1998). The 
log-likelihood is easily obtained from (20.12). Both variants of the model are easily 
estimated by ML, with details given in, for example, Cameron and Trivedi (1998). In 
both variants the coefficients have the same interpretation since E[y|x] = exp(x’). 
The NB2 variant is the most often used, as in the application in Section 20.7. 

The NB2 model has been found to be very useful in applied work. It appears to 
have the flexibility necessary for providing a good fit to many types of count data. It 
does so in part because the quadratic variance specification is a good approximation 
in many empirical situations. An unfortunate consequence of the fact that NB2 often 
provides a good fit is that if the Poisson assumption fails there is a tendency to jump 
to the negative binomial alternative, ignoring other possibilities. Such a mechanical 
approach should be avoided because poor performance of the Poisson can also be due 
to a poor specification of the conditional mean function, and observe that using the 
negative binomial model maintains the same conditional mean. 
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The negative binomial model is less robust to distributional misspecification than 
the Poisson. Even if the conditional mean is correctly specified the MLE in negative 
binomial models is inconsistent, except for the special case of the NB2 model, whereas 
the MLE for G (but not æ) is still consistent. 

For mixture models for counts, the Poisson is the natural choice for the initial den- 
sity f(ylu, v) in (20.12) since a Poisson process is a natural model for counts. The 
choice of the gamma for the mixing distribution g(v) in (20.12) is more arbitrary. Its 
use raises issues discussed in Section 18.2—18.4. Other possible choices include the 
lognormal distribution and the inverse-Gaussian distribution. See Willmot (1987) and 
Guo and Trivedi (2002). In these cases the marginal distribution cannot be expressed 
in a closed form, as it is the gamma that is conjugate to the Poisson. Of course, this 
does not mean that the resulting model cannot be estimated by maximum likelihood. It 
means simply that one may have to use numerical quadrature or simulated maximum 
likelihood to estimate the model. These methods are entirely feasible with currently 
available computing power. If one is prepared to use the simulation-based estimation 
methods discussed in chapter 12, the scope for using mixed-Poisson models of various 
types becomes very extensive. 


20.4.2. Simulated Maximum Likelihood 


Purely for purposes of illustration we now illustrate how we might estimate the NB2 
model by maximum simulated likelihood. The reader should understand that in prac- 
tice this is unnecessary because we already have an analytical expression for that 
model. Suppose we pretend that we do not and tackle estimation by simulation. 

Note that h(y|q, u) in (20.12) can be approximated by 


1 eP" (uv) 
S 3 y! i 


where v, (s = 1,..., S) are pseudo-random draws from the distribution g(v|œ), and 
S is the number of simulation replications used. Drawing from a gamma distribution 
with mean 1 and variance a is straightforward. One draws from a uniform distribution 
and then applies a transformation to it. Let u, denote the uniform random variables 
and let v, = —Inu,/a, and then define the simulator 


emni uN (u (= Inu foo)! 
y! 


FOlvs, æ, u) = 


Then the MSL estimator PMs maximizes 


N 1 EnA 
Qy (@) = Soin (; Yd Foi uh, o) (20.14) 
i=1 s=] 


where u; = exp(x' 6) and 0 = (a, B). 
Of course, this method is computer intensive but otherwise straightforward. A fuller 
discussion of the properties of MSL was given earlier in Chapter 12.4. Here we just 
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remind the reader that when S, N — oo, S/ JN — Othen Ons and Onn: are asymp- 
totically equivalent. 


20.4.3. Finite Mixture Models 


The mixture model in the previous section was a continuous mixture model, because 
the mixing random variable v was assumed to have continuous distribution. An al- 
ternative approach instead uses a discrete representation of unobserved heterogeneity, 
which generates a class of models called finite mixture models; see Section 18.5. 
This class of models is a particular subclass of latent class models. Some variants and 
special cases of this model are also known as discrete factor models. 

In empirical work the more commonly used alternative to the continuous mixture is 
found in the class of modified count models discussed in the next section. However, it 
is more natural to follow up the preceding section with a discussion of finite mixtures. 
Further, the subclass of modified count models can be viewed as a special case of finite 
mixtures. 

We suppose that the density of y is a linear combination of m different densities, 


where the jth density is fj(y|@;), j = 1,2,...,m. Thus an m-component finite mix- 
ture is 
fOI0.m => mf), Osa) <1, X r=. (20.15) 
j=l j=l 


In the given formulation the components of the mixture are assumed, for generality, 
to differ in all their parameters. More restrictive formulations assume that only some 
parameters differ across the components (e.g., the intercepts) and the remaining param- 
eters are all common to the mixture components. Assumptions at some intermediate 
level of generality may also be made. 

For further insight consider this approach for the m = 2 case. Suppose that the 
sampled population contains two “types” of cases, whose y-outcomes are character- 
ized by distributions fi(y|@1) and f2(y|@2), which we assume have different moments. 
Suppose type-1 subpopulation has mean u(01), and type-2 subpopulation has mean 
(82), where u(02) < u(01). For example, in a study of the use of medical services, 
the first subpopulation corresponds to frequent users of the service and the second to 
relatively infrequent users. Assume that the fractions of the two types in the popula- 
tions are mı and 22(= 1 — 7), respectively. Then a random sample drawn from the 
population will contain proportions x; and 7m2 of the two types, although one cannot 
observe which case belongs to which subpopulation. That is, the “types” are latent 
classes. 

The goal of the researcher who uses this model is to estimate the unknown param- 
eters 0j, j = 1,...,m. It is easy to develop regression models based on (20.15). For 
example, if NB2 models are used then f;(y|@;) is the NB2 density (20.12) with pa- 
rameters u; = exp(x’3;) and œj, so 0; = (G;, œj). If the number of components, m, 
is given, then under some regularity conditions maximum likelihood estimation of the 
parameters (z;,0;), j = 1,...,m, is possible. 
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The pros and cons of the finite mixture representation have also been given earlier 
and will only be briefly mentioned here. Further discussion in the context of dura- 
tion models is in Section 18.5. First, a finite mixture is a flexible and parsimonious 
method of modeling the data. Each mixture component provides a local approxima- 
tion to some part of the true distribution. Second, the finite mixture approach is in a 
sense semiparametric because it does not require any distributional assumptions for 
the mixing variable. Finally, in many cases the results are easy to interpret. The finite 
mixture representation is attractive if the investigator is especially interested in the 
behavior of a subpopulation from the viewpoint of public policy. If latent classes are 
ignored, so m = 1, then the estimated parameters will be weighted sums of the latent 
class parameters. 

There are several potential difficulties also. First, we may have very little theoretical 
guidance on specifying the number of components, and we may not be able to reliably 
distinguish among some of the components if they are not sufficiently different. The 
usual practice is to start with a few components and then add additional components 
if the fit of the model is significantly improved by doing so. In some cases only the 
intercepts may be allowed to differ and the slopes may be constrained to equality across 
components. Caution is necessary in this process because the sampling properties of 
the maximum likelihood estimator are not fully understood for the case in which m is 
unknown. 

There are several studies that indicate that finite mixture models work quite well for 
count data models of medical care (Deb and Trivedi, 1997; 2002). One possible reason 
for this is that the population might be split by the latent health status of individuals. 
Those who are healthy, perhaps the majority, might account for low average demand, 
whereas those who are ill may account for high average demand. When the observed 
health status is imperfectly observed, the finite mixture model may do a good job of 
separating subpopulations. 


20.4.4. Truncation and Censoring 


In some studies, inclusion in the sample requires that sampled individuals have been 
engaged in the activity of interest. Then the count data are truncated, as the data are 
observed only over part of the range of the response variable. Examples of truncated 
counts include the number of bus trips made per week in surveys taken on buses, 
the number of shopping trips made by individuals sampled at a mall, and the number 
of unemployment spells among a pool of unemployed. In all these cases we do not 
observe zero counts, so the data are said to be zero-truncated, or more generally 
left-truncated. Right-truncation results from loss of observations greater than some 
specified value. 

A general treatment of truncated and censored models, using ML estimation, is 
given in Section 16.2. Here we specialize to count data. 

Truncation leads to inconsistent parameter estimates unless the likelihood function 
is suitably modified. Consider the case of zero truncation. Let f(y|@) denote the den- 
sity function and F(y|@) = Pr[Y < y] denote the cumulative distribution function of 
the discrete random variable, where @ is a parameter vector. If realizations of y less 
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than the positive integer 1 are omitted, the ensuing zero-truncated density is given by 


fOlIA y= D= re V1 2} vss.8 (20.16) 
This specializes in the zero-truncated Poisson case, for example, to f(y|u, y > 1) = 
e "w /Ly!(1 — exp(—p))]. It is straightforward to construct a log-likelihood based on 
this density and to obtain maximum likelihood estimates. 

Censored counts most commonly arise from aggregation of counts greater than 
some value. This is often done in survey design when the total probability mass over 
the aggregated values is relatively small. An important difference between truncation 
and censoring is that in the case of the latter, covariates corresponding to the cen- 
sored counts are observed; in the truncation case neither the counted outcomes nor 
the covariates are observed. Censoring, like truncation, leads to inconsistent parameter 
estimates if the uncensored likelihood is mistakenly used. See also Section 16.2. 

For example, the number of events greater than some known value c might be ag- 
gregated into a single category. Then some values of y are incompletely observed; the 
precise value is unknown but it is known to equal or exceed c. The observed data has 
density 


fOO) ify <c, (20.17) 
0) = 
8010) E ag ify >c, 
where c is known. 
A related complication is that of sample selection (Terza, 1998). Then the count y 
is observed only when another random variable, potentially correlated with y, crosses 
a threshold. For example, to see a medical specialist one may first need to see a general 


practitioner. 


20.4.5. Modified Count Models 


The leading motivation for the modified count models of this section is to solve the so- 
called problem of excess zeros, the presence of more zeros in the data than predicted 
by count models such as the Poisson, or even NB2. 


Hurdle or Two-Part Models 


The hurdle model or two-part model (see Section 16.4) relaxes the assumption that 
the zeros and the positives come from the same data-generating process. The zeros are 
determined by the density fı (-), so that Pr[y = 0] = fı(0). The positive counts come 
from the truncated density fə (yly > 0) = fo(y)/(1 — f2(0)), which is multiplied by 
Pr[y > 0] = 1 — fı(0) to ensure that probabilities sum to unity. Thus 


fi) if y =0, (20.18) 
g(y) = 4 1— fi) 
T- Aor?” ify = 1. 


This reduces to the standard model only if f\(-) = f2(-). Thus in the modified model 
the two processes generating the zeros and the positives are not constrained to be 
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the same. Although the motivation for this model is to handle excess zeros, it is also 
capable of modeling too few zeros. 

Maximum likelihood estimation of the hurdle model involves separate maximiza- 
tion of the two terms in the likelihood, one corresponding to the zeros and the other to 
the positives. This is straightforward. 

A hurdle model has the interpretation that it reflects a two-stage decision-making 
process. For example, a patient may initiate the first visit to a doctor, but the second 
and subsequent visits may be determined by a different mechanism (Pohlmeier and 
Ulrich, 1995). 

Regression applications use hurdle versions of the Poisson or negative binomial, 
obtained by specifying fı(-) and f2(-) to be the Poisson or negative binomial densities 
given earlier. In application the covariates in the hurdle part that models the zero/one 
outcome need not be the same as those that appear in the truncated part, although in 
practice they are often the same. The hurdle model is widely used, and the hurdle 
negative binomial model is quite flexible. Drawbacks are that the model is not very 
parsimonious, typically the number of parameters is doubled, and parameter interpre- 
tation is not as easy as in the same model without hurdle. 

The choice of the distribution in the hurdle specification is important. Using a more 
flexible distribution gives the negative binomial obvious advantages over the Poisson. 
The conditional mean in the hurdle model is the product of the probability of positives 
and the conditional mean of the zero-truncated density. Therefore, using a Poisson re- 
gression when the hurdle model is the correct specification implies a misspecification, 
which will lead to inconsistent estimates. Because of the form of the conditional mean 
specification, the calculation of marginal effects is more complicated, with similarities 
to the two-part model used in Section 16.4. 


With-Zeros or Zero-Inflated Model 


A second modified count model is the with-zeros model or zero-inflated model. This 
supplements a count density f>(-) with a binary process with density fı(-). If the binary 
process takes value 0, with probability fı(0), then y = 0. If the binary process takes 
value 1, with probability f;(1), then y takes count values 0,1,2,... from the count 
density f2(-). This lets zero counts occur in two ways: as a realization of the binary 
process and as a realization of the count process when the binary random variable takes 
value 1. The density is 


e | AOA- AOAO ify=0, (20.19) 
A = fOO) ify > 1. 
Regression models let fı(-) be a logit model and f2(-) be a Poisson or negative bi- 
nomial density. This model is used much less than the hurdle model. It is capable of 
modeling too few zeros. 
The zero-inflated count model is used less frequently in econometrics than in other 
statistical disciplines. 
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20.4.6. Discrete Choice Models 


Count data can be modeled by discrete choice model methods, possibly after some 
grouping of counts to limit the number of categories. For example, the categories may 
be 0, 1, 2, 3, and 4 or more if few observations exceed four. Unordered models such 
as multinomial logit, discussed in Section 15.4, are not parsimonious and more impor- 
tantly are inappropriate. Instead, a sequential model that recognizes the ordering of the 
data should be used. 

One such model is an ordered model. This defines an unobserved latent variable, 
y* = x'ß + u, with values of y = 0,1,2,... being observed as y* crosses progres- 
sively higher thresholds, which are also parameters to be estimated. An ordered logit 
(or probit) model arises when u is logistic (or standard normal) distributed. Ordered 
models (see Section 15.9) are particularly useful when the count can also take nega- 
tive values as may occur when modeling a net change, such as the net change in the 
number of firms in an industry. 

Another possible sequential model, although less parsimonious, is obtained by spec- 
ifying a sequence of binary models for Pr[y = 1|y > 0], Pr[y = 2|y > 1], and so on. 

Finally, in some cases durations may be available in addition to counts. For example, 
if the dates of doctor visits are known, one can model a count, the number of visits in 
a month, say, or the duration of time between visits. In general, the latter approach 
is more efficient, since it uses more detailed data, but the count regression can still 
provide useful information about the role of covariates (Dean and Balshaw, 1997). 


20.5. Partially Parametric Models 


By partially parametric models we mean that we focus on modeling the data via the 
conditional mean and variance, and even these may not be fully specified. In Sec- 
tion 20.5.1 we consider models based on specification of the conditional mean and 
variance. In Section 20.5.2 we consider and critique the use of least-squares methods 
that do not explicitly model the heteroskedasticity inherent in count data. In Section 
20.5.3 we consider models that are even more partially parametric, such as those giving 
an incomplete specification of the conditional mean. 

The approach is similar in flavor to NLS, except that here we allow for het- 
eroskedasticity that is well modeled as a function of the conditional mean. 


20.5.1. Quasi-ML Estimation 


As discussed in Section 20.2.1, when using PML or QML, the distribution of the es- 
timator is obtained under weaker assumptions about the dgp than those that lead to a 
specific likelihood function. 

Let us reconsider (20.6). Given an assumption for the functional form for @;, and 
a consistent estimate ©; of @;, one can consistently estimate this covariance ma- 
trix. We could use the Poisson assumption, w; = ui, but as already noted the data 
are often overdispersed, with œ; > ui. Common variance functions used are œw; = 
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(1 + æui)ui, that of the NB2 model discussed in Section 20.4.2, and w; = (1 + @)y,;, 
that of the NB1 model. Note that in the latter case (20.6) simplifies to VpmilBpl = 
(+g) DD pixxx!) ', so with overdispersion (œ > 0) the usual ML variance matrix 
given in (20.7) understates the true variance. 

If w; = E[(y; — x; BIX] is instead unspecified, a consistent estimate of VemilBpl 


can be obtained by adapting the Eicker-White robust sandwich variance estimate 


formula to this case. The middle sum in (20.6) needs to be estimated. If £; 2 


ui then N7! Yi — ))?x:X, + lim N7! J; @:x;x;. Thus a consistent estimate of 
VpmML [Bp] is given by (20.6) with œ; and u; replaced by (y; — 72;)? and 7Z;. 

When doubt exists about the form of the variance function, the use of the PML es- 
timator is recommended. Computationally this is essentially the same as Poisson ML, 
with the qualification that the variance matrix must be recomputed. The calculation of 
robust variances is often an option in standard packages. 

These results for Poisson PML estimation are qualitatively similar to those for PML 
estimation in the linear model under normality. They extend more generally to PML 
estimation based on densities in the linear exponential family. In all cases consistency 
requires only correct specification of the conditional mean (Nelder and Wedderburn, 
1972; Gouriéroux et al., 1984a). This has led to a vast statistical literature on gener- 
alized linear models (see McCullagh and Nelder, 1989). These permit valid inference 
providing the conditional mean is correctly specified and nest many types of data as 
special cases — continuous (normal), count (Poisson), discrete (binomial), and positive 
(gamma) as detailed in Section 5.7.4. Many methods for complications, such as time- 
series and panel data models, are presented within the more general GLM framework 
rather than specifically for count data. 

Some econometricians find it more natural to use the GMM framework rather than 
GLM. Then the starting point is the conditional moment E[y; — exp(x; 6)|x;] = 0. If 
data are independent over i and the conditional variance is a multiple of the mean it can 
be shown that the optimal choice of instrument is x;, leading to the estimating equa- 
tions (20.5); for more detail, see Cameron and Trivedi (1998, pp. 37-44). The GMM 
framework has been fruitful for panel data on counts (see Section 20.5.3) and for en- 
dogenous regressors. Fully specified parametric simultaneous equations models for 
counts are in their infancy, so instrumental variables methods are appealing. Given 
instruments z;, dim(z) > dim(x), satisfying Ey; — exp(x;3)|z;] = 0, a consistent esti- 
mator of 3 minimizes 


N g N 
Q(B) = bar - e020 W pac - expan l (20.20) 
i=1 i=1 


where W is a symmetric weighting matrix. 

The pros and cons of this approach are as follows. A major advantage is that the 
approach makes fewer distributional assumptions and hence avoids a possible model 
misspecification. However, the discreteness in the outcome variable and its natural het- 
eroskedasticity are ignored, leading to a possible loss of efficiency. A suitable of choice 
of W matrix may mitigate the problem. Further, by emphasizing the first moment of 
the distribution, when potentially there may be significant additional information in 
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the higher moments, the IV estimator may be sensitive to the presence of large counts 
in the data. Table 20.2 illustrates features of some types of data that are awkward to 
model using a GMM-type estimator. 


20.5.2. Least-Squares Estimation 


When attention is focused on modeling just the conditional mean, least-squares meth- 
ods are inferior to the approach of the previous section. 

Linear least-squares regression of y on x leads to consistent parameter estimates if 
the conditional mean is linear in x. However, for count data the specification E[y|x] = 
x’ 3 is inadequate as it permits negative values of E[y|x]. For similar reasons the linear 
probability model is inadequate for binary data. 

Transformations of y may be considered. In particular, the logarithmic transforma- 
tion regresses In y on x. This transformation is problematic if the data contain zeros, 
as is often the case. One standard solution is to add a constant term, such as 0.5, and to 
model In(y + .5) by OLS. This ad hoc method introduces problems of retransforma- 
tion if we are interested in E[y|x] rather than E[In y|x]; see Mullahy (1998). However, 
conversion to a linear model has the advantage of convenience if, for example, there is 
an endogenous right-hand variable that needs to be “instrumented.” 

It is instead better to use nonlinear least squares with the exponential mean specifi- 
cation; that is, estimate the nonlinear regression model y = exp(x’) + u. It is impor- 
tant that statistical inference for the NLS estimator be based on Eicker—White robust 
standard errors since the error term in this regression will be heteroskedastic. 

For counts the NLS estimator is generally less efficient than the Poisson pseudo- 
MLE. The NLS first-order condition is }°,(y; — exp(x;)) exp(x}G)x; = 0. This 
weights the residuals differently than does the Poisson pseudo-MLE (see (20.5)). The 
NLS weights are optimal if V[y;|x;] is constant (homoskedastic) whereas the Poisson 
pseduo-MLE weights are optimal if V[y;|x;] is a multiple of E[y;|x;]. The latter is a 
much better model for handling the inherent heteroskedasticity of count data. 


20.5.3. Semiparametric Models 


By semiparametric models we mean partially parametric models that have an infinite- 
dimensional component, as developed in Section 9.7. The curse of dimensionality mo- 
tivates us to put some structure on the conditional mean function. 

One class of semiparametric models incompletely specifies the conditional mean. 
Leading examples are single-index models and partially linear models. Single-index 
models specify u; = g(x;), where the functional form g(-) is left unspecified. Par- 
tially linear models specify u; = exp(x; + g(z;)), where the functional form g(-) is 
left unspecified. In both cases s/N-consistent asymptotically normal estimators of 3 
can be obtained, without knowledge of g(-). 

A second example is optimal estimation of the regression parameters 3, when u; = 
exp(x; 3) is assumed but V[y;|x;] = w; is left unspecified. The infinite-dimensional 
component arises because as N — oo there are infinitely many variance parameters 
wi. An optimal estimator of 6, called an adaptive estimator, is one that is as efficient 
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as that when œw; is known. Delgado and Kniesner (1997) extend results for the linear re- 
gression model to count data with exponential conditional mean function, using kernel 
regression methods to estimate weights to be used in a second-stage nonlinear least- 
squares regression. In their application the estimator shows little gain over specifying 
wi = u;(l + æ ui), overdispersion of the NB2 form. 


20.6. Multivariate Counts and Endogenous Regressors 


In this section we very briefly present extension from cross-section to other types of 
count data (see Cameron and Trivedi, 1998, for further detail). For multivariate count 
data many models have been proposed but preferred methods have not yet been estab- 
lished. For related panel data there is more agreement in the econometrics literature on 
which methods to use, though a wider range of models is considered in the statistics 
literature; see Section 23.7. 


20.6.1. Multivariate Data 


In some data sets more than one count is observed. For example, data on the utiliza- 
tion of several different types of health service, such as doctor visits and hospital days, 
may be available. Joint modeling will improve efficiency and provide richer models 
of the data if counts are correlated. This section briefly reviews bivariate count mod- 
els related to the main models of this chapter. The reader familiar with multiequation 
linear models with correlated errors, e.g. the SUR model in Section 6.9.3, may think 
of a generalization to multiequation count models with correlated errors. Assume that 
we observe several count variables for the same individual (e.g., number of visits to 
a doctor and number of prescribed medications taken). The source of correlation may 
lie in unobserved heterogeneity. Joint estimation that takes account of correlated er- 
rors will yield more efficient estimates, but at the cost of additional computational 
complexity. 


Semiparametric Methods 


A partially parametric approach views this as a seemingly unrelated regressions prob- 
lem, adapting methods for the linear regression model to count data where the condi- 
tional means are nonlinear and the data are heteroskedastic; see Section 6.10.3. 

Gouriéroux, Monfort, and Trognon (1984b) propose a moment-based approach to 
derive the bivariate Poisson-type model. They specify a model by defining first two 
moments of yı and y) and estimate it by a quasi-generalized pseudo-maximum like- 
lihood procedure. This model allows for overdispersion and is more general than the 
bivariate Poisson model, but it does not maintain the integer-valued property of the 
counts. 

Delgado (1992) treats a multivariate count model as a multivariate nonlinear model 
and suggests a semiparametric generalized least-squares estimator. The covariance ma- 
trix of the residuals is estimated using the K-NN method. The approach differs from 
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that of Gouriéroux, Monfort, and Trognon (1984) in the choice of the estimator for the 
covariance matrix. 

Most parametric studies have used the bivariate Poisson. One way this distribution 
is derived is to suppose that the two counts yı and y2 are generated as y; = zı + w 
and y2 = Z2 + w, where all of z1, z2, and w are independent and Poisson distributed, 
with positive parameters 1, 42, and Aj, respectively, which may be parameterized as 
a function of exogenous covariates. This is called the method of trivariate reduction. 

The marginal distribution of y; is Poisson[A; + A12] and, therefore, this model re- 
stricts the conditional mean to be equal to the conditional variance for each count 
variable, so 


E[y,|xj] = VIy;lx;] (20.21) 
for j = 1,2, where x; is a vector of explanatory variables. The correlation coefficient 
is given by 

Ai2 


; 20.22 
VAi + A12)A2 + A12) i ) 


Cor[y, y2] = 


which is positive, because A142 > 0. 


Fully Parametric Methods 


Several recent studies develop better parametric models by introducing correlated 
unobserved heterogeneity for each count. The related issues were discussed in Sec- 
tions 6.10.1 and 19.3. 

Marshall and Olkin (1990) consider a model with multiplicative unobserved het- 
erogeneity in the marginal distributions of both counts in the following way. Let yj 
be P [A j v] , j = 1,2, where P denotes Poisson distribution with mean 4 ;v and v has 
gamma distribution with density 


por! exp(—v) 


ræ) 
The random variable v can be interpreted as common (shared) unobserved hetero- 
geneity. The resulting model is a one-factor model. The bivariate negative binomial 
(BVNB) distribution of two counts is defined as 


go)= 


fi, Y21X1, X2) = l fiQlx1, v) f2Qv2 1X2, vg(v)dv (20.23) 
0 
B | [pee ee pool exp(=V) jy 
7 PET yj! rœ) 
_ Torty2+@) | ài | | à2 T 
O yya) [àti Ay +241 


1 Q 
Ay +tA2+1 
This mixture has a closed-form solution, but the model restricts the unobserved het- 


erogeneity to be the identical component for both count variables. The joint likelihood 
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is built up with terms like (20.23). The marginal distributions are univariate negative 
binomial and the correlation between the two count variables, 


at 
JO? + ai)(a3 + am) 


Cor [yı y2] = (20.24) 


must be positive. 

Other models with more flexible correlation structures, but that also require 
computationally advanced methods, have been proposed by Cameron and Johansson 
(1998), Munkin and Trivedi (1999), and Chib and Winkelmann (2001). 

Munkin and Trivedi (1999) consider a generalization of the BVNB model as 
follows: 


CO CO 
fOr, y2[X1, X2) = i f fi@ilXı, V1) fov2|X2, v2)81, v2)dvidv2, (20.25) 
o Jo 


where the joint distribution is built up from the two marginal models, each condi- 
tioned on a separate unobserved heterogeneity variable, vı and v2, respectively, that are 
specified to gave a bivariate normal distribution. Conditional on (x), X2, v1, v2) each 
marginal distribution is Poisson, with multiplicative unobserved normal heterogene- 
ity. The model is therefore a bivariate Poisson—log-normal mixture. The likelihood 
function is the product over the sample of terms like (20.25). The authors interpret 
this as a “two-factor model.” This specification is more flexible as it does not restrict 
the sign or size of correlation between the two unobserved components. However this 
additional flexibility introduces computational complexity because the bivariate inte- 
gral in (20.25) does not have an analytical solution and hence must be handled us- 
ing a simulation-based approach (discussed in Chapter 12). 2.4 and in Munkin and 
Trivedi (1999). If the dimension of the model, the number of y variables, increases, 
then so does the order of numerical integration involved. This feature combined with 
a possibly large sample size can make computational burden very significant. Chib 
and Winkelmann (2001) suggest an alternative Bayesian MCMC approach, which, 
while retaining the flexibility of the aforementioned specification, can handle a high- 
dimensional outcome vector. They demonstrate the feasibility of their approach with a 
six-dimensional mixed Poisson—log-normal model. 

Another recently developed approach to modeling correlated counts is the cop- 
ula approach described in Section 19.3. Here one begins with the specification of 
marginal distributions; the joint distribution is obtained by combining the marginals 
using a copula. Examples for dependent durations were given in Section 19.3. See also 
Cameron, Li, Trivedi, and Zimmer (2004). 


20.6.2. Count Models with Endogenous Regressors 


Simultaneous models for count variables arise in a number of contexts. For example, 
in Cameron et al. (1988) the focus is on a count variable (medical utilization), but one 
of the covariates, the health insurance status of the subject, is an endogenous choice. 
Mullahy (1997) in a cross-section context, and Crépon and Duguet (1997b) in a panel 
data context, apply the GMM approach to count models with endogenous regressors. A 
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very well known example from health economics involves models of counts of health 
services, such as doctor visits, and one of the regressors would be the health insurance 
status of the individual. The assumption that the choice of health insurance and the 
error on the outcome equation are uncorrelated is unrealistic, and hence the insurance 
regressor is likely to be endogenous. Chapter 22 provides more examples and details 
of panel count models with endogenous regressors. 

Currently the econometric literature provides two approaches to the estimation of 
models with endogenous regressors: one based on the GMM/IV approach and the other 
based on stronger assumptions of maximum likelihood. We consider each in turn. 

The first approach (Mullahy, 1997) begins with a moment condition. Consider the 
exponential mean model with additive zero-mean error term, 


yi = Ely;|x;] + v; = exp(x; B) + v;, (20.26) 
Elv; |x;] Æ 0. (20.27) 


Suppose that we have available instrumental variables z; that satisfy the moment 
conditions 


E[v;|z;] = 0, (20.28) 
E[y; — exp(x.p)|z;] = 0. 


Then the GMM or nonlinear IV estimation is feasible, assuming that there are enough 
moment conditions available. This approach has already been discussed in Sec- 
tion 6.5.3. The reader is referred to this section for details and related discussion. 
However, note that in implementing this approach the count nature of the variable is 
ignored and the model is treated like any other nonlinear model with an exponential 
mean. Also, note that heteroskedasticity is highly likely with counted data and hence 
the GMM/TV procedure should accommodate this complication. 

Mullahy has pointed out that a multiplicative error term specification has certain 
advantages. This, however, leads to a different moment condition. Let 


ELyi|x;, vi] = exp(x;6)vi. (20.29) 


This leads to the moment condition 


Ji 
E Fea — iz | =0, (20.30) 
which is a special case of the nonlinear moment condition E[r (y;, x;, B)|z;] = O dis- 
cussed in Section 6.5. Provided suitable and sufficient moment conditions are avail- 
able, the GMM approach can be followed. Once again, however, for a counted variable, 
heteroskedasticity is likely and efficiency loss will occur because the count feature of 
the variable has been ignored. 

Alternative approaches that simultaneously handle the count feature of the depen- 
dent variable and the problem of endogenous regressors are more parametric (Terza, 
1998). Deb and Trivedi (2004) develop a joint model of counts (Y) with insurance 
plan variable (D) as regressors and a binary choice model for the insurance plan. 
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Endogeneity in their model arises from the presence of correlated unobserved hetero- 
geneity in the outcome (count) equation and the binary choice equation. Their model 
has the following structure: 


Pr[Y; = yilXi, Di, i] = f(x;B+ yı Di + àli), (20.31) 
Pr[D; = 1|z;, li] = g8(z;œ + ôli), (20.32) 


where l; are latent factors reflecting unobserved heterogeneity and ô and à are an 
associated factor loadings. The joint distribution of selection and outcome variables, 
conditional on the common latent factors, can be written as 


Pr[Y; = yi, Di = 1|x;, zi, li] = FB + yidi + àli)g (zœ + 6h), (20.33) 


because (Y, D) are assumed to be conditionally independent. 

The problem in estimation arises because the l; are unknown. Although the l; are 
unknown, assume that h, the distribution of /;, is known and can therefore be integrated 
out of the joint density, that is, 


PrlY; = yi, Di = Ux, z] = f [fi B+ yi Di + Ali) gaia + 41;)] h(@)dl;. (20.34) 


Cast in this form, the unknown parameters of the model may be estimated by maximum 
likelihood. 

For simplicity we assume /(/;) has no unknown parameters. Then the maximum 
likelihood estimator maximizes the joint likelihood function L(0@1, 02|y;, Di, Xi, Zi), 
where 0; = (8, yı, A) and @2 = (a, ô) refer to parameters in the outcome and plan 
choice equations, respectively, and L refers to the joint likelihood whose ith compo- 
nent is defined in (20.34). For identification additional normalization restrictions may 
be needed. 

The main practical problem of estimation given suitable specifications for f, g, 
and h is that the integral does not have, in general, a closed-form solution. The MSL 
estimator involves replacing the expectation by a simulated sample analogue (average), 
that is, 


i 1 


s T bs 
Pr[¥; = yi, Di = 1x z] = = [ŒB +yı Di + Alis)g(gpa + ôlis)] . (20.35) 
=l 


vl 


where Tr is the sth draw (from a total of S draws) of a pseudo-random number from 
the density h and Pr denotes the simulated probability. A simulated likelihood function 
for the data can then be defined. The MSL estimator maximizes the simulated log- 
likelihood. 

This approach, developed for an endogenous dummy regressor in a count regres- 
sion model, can be extended to multiple dummies, and multiple outcomes, whether 
discrete or continuous. The limitation comes from the burden of estimation, which is 
very heavy compared with an IV-type estimator. Further, as in any simultaneous equa- 
tion model, identifiability is an issue. Applied work typically includes some nontrivial 
explanatory variables in the z vector that are excluded from the x vector. 
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20.7. Count Example: Further Analysis 


We now reconsider the earlier analysis based on the Poisson regression by using more 
flexible parametric models beginning with the NB2 model. 

The results for the NB2 model are given in the last columns of Table 20.5, presented 
in Section 20.3. Here too we report the robust standard errors and t-ratios. First note 
that the overdispersion coefficient œ is highly significant. The Wald test statistic is 
8.926, leading to a decisive rejection of the null of equidispersion (a = 0). Consistent 
with this is the large increase in the log-likelihood, from —60,087 to —42,777. Clearly, 
the improvement in the fit of the model is considerable. Because the models are nested 
it is unnecessary to report AIC and BIC. 

Row 3 in Table 20.6 shows the predicted frequencies from the NB2 model. These 
are very close to the observed frequencies and confirm the improvement in the fit of 
the model as a result of overdispersion being accounted for. 

The coefficients themselves, however, seem fairly stable among alternative estima- 
tion methods, and all effects are measured with precision, reflecting the impact of 
the large sample. These features of the results are encouraging, suggesting that the 
NB2 model is reasonable. As predicted by basic economic theory, utilization and the 
coinsurance rate (LC) are negatively correlated. The estimated impact does not seem 
sensitive to the treatment of overdispersion. 

Additional modeling refinements are possible. For example, Deb and Trivedi (2002) 
compare the performance of the two-part (hurdle) model with a two-component finite 
mixture model and find the latter to fit better. However, even the hurdle model fits better 
than the NB2 model. Although such refinements provide additional information, none 
of the results given here can be regarded as misleading on the essential question of 
price sensitivity of utilization. 

The NB2 model works well for doctor visits. For other count outcomes, however, 
even more flexible models than NB2 may be necessary. 


20.8. Practical Considerations 


Those with experience of nonlinear least squares will find it easy to use packaged 
software for Poisson regression, which is a widely available option in popular econo- 
metrics and statistics packages. Care is needed to ensure that robust standard errors 
are obtained. Many econometrics packages also include negative binomial regression 
and the basic panel data models. Popular statistics packages include count regres- 
sion in a generalized linear models module. Standard packages also produce some 
goodness-of-fit statistics, such as the pseudo-R* measures, for the Poisson model 
see Section 8.7.1. 

More recently developed models, such as finite mixture models, most time-series 
models, and dynamic panel data models, require developing one’s own programs. A 
promising route is to use matrix programming languages in conjunction with soft- 
ware for implementing estimation based on user-defined objective functions. For 
simple models many computer programs make it possible to implement maximum 
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likelihood estimation and (highly desirable) robust variance estimation for user- 
defined functions. 

In addition to reporting parameter estimates it is useful to have an indication of 
the magnitude of the estimated effects, as discussed in Section 20.2.3. As noted in 
Section 20.2.4, care should be taken to ensure that reported standard errors and t- 
statistics for the Poisson regression model are based on variance estimates robust to 
overdispersion. 

In addition to estimation it is strongly recommended that specification tests be used 
to assess the adequacy of the estimated model. For Poisson cross-section regression 
overdispersion tests are easy to implement. For any parametric model one can compare 
the actual and fitted frequency distribution of counts, although it is not always easy to 
understand the respect in which a model fails when the distribution of observed counts 
is highly dispersed. Formal statistical specification and goodness-of-fit tests based on 
actual and fitted frequencies are available. 

In most practical situations one is likely to face the problem of model selection. 
For likelihood-based models that are nonnested one can use selection criteria, such 
as the Akaike information criteria, that are based on the fitted log-likelihood but with 
degrees-of-freedom penalty for models with many parameters. 


20.9. Bibliographic Notes 


20.2 All the topics dealt with in this chapter are treated at greater length and depth 
by Cameron and Trivedi (1998), who also provide a comprehensive bibliography. 
Winkelmann (1997) also provides a treatment of the econometric literature on counts. 
The statistics literature generally analyzes counts in the context of GLM. The stan- 
dard reference is McCullagh and Nelder (1989). The econometrics literature gener- 
ally underemphasizes the contributions of the GLM literature. Fahrmeier and Tutz 
(1994) provide a recent and more econometric exposition of GLMs. The material in 
Section 20.2 is standard and appears in many places. 


20.3 Deb and Trivedi (2002) give a detailed analysis of these RHIE data. 


20.4 Cameron and Trivedi (1986) provide an early presentation and application of the 
negative binomial. Hausman et al. (1984) applied the model and its variants to panel 
data. For the finite mixture approach of Section 20.4.3 see Deb and Trivedi (1997). 
Applications of the hurdle model in Section 20.4.5 include those by Mullahy (1986), 
who first proposed the model, Pohlmeier and Ulrich (1995), and Gurmu and Trivedi 
(1996). 


20.5 The quasi-MLE of Section 20.5.1 is presented in detail by Gouriéroux et al. (1984a,b) 
and by Cameron and Trivedi (1986). 


20.6 Regression models for the types of data discussed in Section 20.6 are in their infancy. 
The notable exception is that (static) panel data count models are well established, 
with the standard reference being Hausman et al. (1984). See also Brännäs and Jo- 
hansson (1996). Developing adequate regression models for multivariate count data 
and models with endogenous regressors is currently an active area; see Terza (1998), 
and Deb and Trivedi (2004). 
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Exercises 


Suppose that Y is Poisson distributed with mean u. 


(a) Verify that the first four moments are, respectively, u, u, u, and 3u? + u. 
(b) Show that there is a linear relationship between Pr[Y = j] and Pr[Y = j — 
1], f= 1,2; 

(c) Consider the ‘Poisson MLE in the regression case with m; = exp(x;{). 
Possible estimates of the variance of the Poisson MLE include Via] = 
[>> nxx]! and VIA] = [>> (vi — 77))?x;x;]-'. Show that they are asymp- 
totically equivalent (upon scaling by N) if the data density is correctly 
specified. 


Now consider overdispersion in the Poisson model. 


(a) Suppose Yu ~ Plu], where u = exp (Bo + 61X), Bo = yo +£, and e is 
an unobserved random variable with E[e] = 0, V[e] = o? > 0. Show that 
V[Y] > ELY]. 

Consider the NB2 model with the variance function u + œu? and the proba- 
bility mass function given in (20.12). Using graphs for four different values of 
a € [0, 3], describe the behavior of the probability mass for different realized 
values of Y; in your answer concentrate on the behavior of the function near 
the origin and in the right tail. 

For the NB2 density given in (20.12) in Section 20.4.1, show that as a > 0 
the density goes to the Poisson. [This could be tricky.] 


(b 


~ 


(c 


— 


Consider the Poisson regression model with conditional mean u = exp(x' 8). 
Treat the estimation problem as an unweighted nonlinear squares problem in 
which y = E[y|x] + £, where E[y|x] = exp(x' 8) and e ~ iid[0, o°]. 

(a) Derive the nonlinear least-squares equations for (3, o°). Compare the least- 
squares and the maximum likelihood equations for @ and explain the differ- 
ence between them. 

(b) Derive the weighted nonlinear least-squares equations for 3. Explain your 
choice of weights. [Weights are used to handle heteroskedasticity]. 

(c) Compare the weighted nonlinear least-squares and the maximum likelihood 
equations and explain the similarities, if any. 


Consider a finite mixture density f(yj@) = E _, 7j fi(yl0 j), an additive mixture 
of C distinct latent classes, or subpopulations, with unknown mixing proportions 
T1, ..., c, where DE zj = 1, xj > 0. Here yis a count variable, and the jth 
component density f;(y;|@;) for the ith observation is expressed as 


oe I (wy) T (y +1) Ajit vy Ajit Wj ` 


where à ji = exp(X; 6j), Wji = aK Jaj, aj > 0 and 6; = (8j, æj). Here kis either 0 

or 1. This model is the finite mixture negative binomial with C components and 

specializes to the finite mixture Poisson if a; = 0. 

(a) Show that Ely |x;] =’; = ae Tjàji and V(y;|X;) = a jai + 
ajar] + hj Àg. 

(b) Show that any mixture model based on the first moment alone is not 
identified. 
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(c) Show that the C-component Poisson mixture based on the first two mo- 
ments is identified. 


(Adapted from Baltagi and Li, 1999) A simple test of overdispersion in a Pois- 
son model given in Section 20.2.4 tests the null hypothesis of zero coefficient 
in the regression of [(y; — 7@;)? — y;]/&; on 7@;. An alternative test proposed in 
the literature (Baltagi and Li, 1999) involves the same test but is based on the 
regression of (y; — 7z;)* on 77;. The latter can be motivated by the idea of tests 
based on the Gauss—Newton regression, (see Section 10.3.9). Analyze the dif- 
ferences between the tests and the implications of the differences for the manner 
of implementing the second test. 


For this problem use a 50% subsample of the data used in this chapter. 


(a) Estimate Poisson and negative binomial regression with MDU as the de- 
pendent variable and the following explanatory variable: LC, IDP, LINC, 
FEMALE, EDUDEC, XAGE, BLACK, HLTHG, HLTHF, and HLTHP. Carry out 
a likelihood ratio test of the null hypothesis that the variables LC and IDP 
have no effect on MDU. 

Test for overdispersion in the Poisson regression using the variance formula- 
tions (20.9) with g(u) = u and (20.10) with g(u) = u? in this chapter. Which 
version of the variance formulation gets more support from the data? What 
do you conclude from this exercise? 

Estimate the negative binomial model (NB2). Compare the estimate of the 
overdispersion parameter with that in part (b). Explain the similarities and 
differences. 

Using the results from the negative binomial estimation, compare the 
estimated marginal effect of a change in LC for an average individual 
in excellent health (baseline) and an average individual in poor health 
(HLTHP = 1). 

For this Poisson specification estimate the “hurdle version” consisting of a 
zero part (logit or probit) and a positive part (truncated-at-zero Poisson). 
Compare these results with those from a regular Poisson model. Analyze 
the similarities and differences between the implications of the two models. 
Based on your analysis, which model do you regard as a better explanation 
of the data? 


(b 


~ 


(c 


— 


(d 


— 


(e 


~ 
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Models for Panel Data 


Cross-section models have certain inherent limitations. They are predominantly equi- 
librium models that generally do not shed light on intertemporal dependence of events. 
They also cannot satisfactorily resolve fundamental issues about the sources of per- 
sistence in behavior. Such persistence may be behavioral, i.e. arising from true state 
dependence, or it may be spurious, being an artifact of the inability to control for het- 
erogeneous behavior in the population. Because panel data, also called longitudinal 
data, contain periodically repeated observations of the same subjects, they have a large 
potential for resolving issues that cross-section models cannot satisfactorily handle. 
Chapters 21 through 23 present methods for panel data. We progress systematically 
from linear models for continuous data in Chapter 21 to nonlinear panel data models 
for limited dependent variables in Chapter 23. Both fixed effects and random effects 
models are considered. A persistent theme through these three chapters is the impor- 
tance of using panel-robust methods of inference. 

Chapter 21, which reviews the key general results for linear panel data regression 
models, can be read easily by those with a good grasp of linear regression; it does not 
require the material covered in Parts 2 to 4. We recommend that even those who are 
interested in more advanced material should quickly peruse through the contents of 
this chapter first to gain familiarity with key concepts and definitions. 

Chapter 22 covers important extensions of Chapter 21, especially to dynamic panels 
which allow for Markovian dependence structure of current variables. The analysis is 
in the GMM framework that is currently favored by many practitioners in this area. 
The analysis here is at times intricate, involving many issues of detail. A strong grasp 
of GMM will be helpful in absorbing the main results of this chapter. 

The results of Chapters 21 and 22 do not extend to nonlinear panel models of Chap- 
ter 23 in a general and unified fashion. There are relatively fewer general results for 
limited dependent variable panel models. Despite this, in Chapter 23 we begin by 
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presenting an analysis of some general issues and approaches. Later sections of this 
chapter present panel data extensions of the counterpart cross-section models studied 
in Part 4. These sections analyze four categories of models for binary, count, censored, 


and duration data, respectively, and should be accessible to a suitably prepared reader 
familiar with the parallel cross-section models. 
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CHAPTER 21 


Linear Panel Models: Basics 


21.1. Introduction 


Panel data are repeated observations on the same cross section, typically of individu- 
als or firms in microeconomics applications, observed for several time periods. Other 
terms used for such data include longitudinal data and repeated measures. The focus 
is on data from a short panel, meaning a large cross section of individuals observed for 
a few time periods, rather than a long panel such as a small cross section of countries 
observed for many time periods. 

A major advantage of panel data is increased precision in estimation. This is the 
result of an increase in the number of observations owing to combining or pooling 
several time periods of data for each individual. However, for valid statistical infer- 
ence one needs to control for likely correlation of regression model errors over time 
for a given individual. In particular, the usual formula for OLS standard errors in a 
pooled OLS regression typically overstates the precision gains, leading to underesti- 
mated standard errors and t-statistics that can be greatly inflated. 

A second attraction of panel data is the possibility of consistent estimation of the 
fixed effects model, which allows for unobserved individual heterogeneity that may 
be correlated with regressors. Such unobserved heterogeneity leads to omitted vari- 
ables bias that could in principle be corrected by instrumental variables methods using 
only a single cross section, but in practice it can be difficult to obtain a valid instru- 
ment. Data from a short panel, with as few as two periods, offers an alternative way 
to proceed if the unobserved individual-specific effects are assumed to be additive and 
time-invariant. 

Most disciplines in applied statistics other than microeconometrics treat any unob- 
served individual heterogeneity as being distributed independently of the regressors. 
Then the effects are called random effects, though a better term is purely random ef- 
fects. Compared to fixed effects models this stronger assumption has the advantage 
of permitting consistent estimation of all parameters, including coefficients of time- 
invariant regressors. However, random effects and pooled estimators are inconsistent 
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if the true model is one with fixed effects. Economists often view the assumptions for 
the random effects model as being unsupported by the data. 

A third attraction of panel data is the possibility of learning more about the dynam- 
ics of individual behavior than is possible from a single cross section. Thus a cross 
section may yield a poverty rate of 20% but we need panel data to determine whether 
the same 20% are in poverty each year. As a related example, panel data may determine 
whether high serial correlation of individual earnings or unemployment spell length is 
due to an individual specific tendency to have high earnings or a long unemployment 
spell, or whether it is a consequence of having past high earnings or unemployment. 
This topic is deferred to Chapter 22. 

The linear panel data models and associated estimators are conceptually simple, 
aside from the fundamental issue of whether or not fixed effects are necessary. The 
considerable algebra used to derive the properties of panel data estimators should not 
distract one from an understanding of the basics: The statistical properties of panel 
data estimators vary with the assumed model and its treatment of unobserved effects. 
Furthermore, much of the algebra does not generalize to nonlinear panel models. 

The current chapter presents the basic estimators for various linear panel data mod- 
els. A lengthy introduction in Sections 21.2 and 21.3 provides, respectively, the com- 
monly used models and estimators and an application to the relationship between an- 
nual hours worked and wages. The important distinction between fixed and random 
effects models is studied in Section 21.4. Sections 21.5—21.7 present additional detail 
on estimation for, respectively, pooled models, individual-specific fixed effects mod- 
els, and individual-specific random effects models. Section 21.8 considers other basic 
aspects such as inference and prediction in linear panel data models. 


21.2. Overview of Models and Estimators 


Panel data provide information on individual behavior both across time and across 
individuals. 

Even for linear regression, standard panel data analysis uses a much wider range of 
models and estimators than is the case with cross-section data. Several standard models 
are presented in Section 21.2.1, followed by several estimators presented in Section 
21.2.2. Table 21.1 gives a summary that also indicates that several of the estimators 
are inconsistent if the dgp is the individual-specific fixed effects model. 

Obtaining correct standard errors of estimators is also more complicated than in 
the cross-section case. One needs to control for correlation over time in errors for a 
given individual, in addition to possible heteroskedasticity. This topic is covered in 
Section 21.2.3. 


21.2.1. Panel Data Models 


A very general linear model for panel data permits the intercept and slope coefficients 
to vary over both individual and time, with 


Yit = Qir + X Bit + Uin, i=1,...,N, Bae eek: 
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Table 21.1. Linear Panel Model: Common Estimators and Models“ 


Assumed Model 
Pooled Random Effects Fixed Effects 
Estimator of G (21.1) (21.3) and (21.5) (21.3) Only 
Pooled OLS (21.1) Consistent Consistent Inconsistent 
Between (21.7) Consistent Consistent Inconsistent 
Within (or Fixed Effects) (21.8) Consistent Consistent Consistent 
First Differences (21.9) Consistent Consistent Consistent 
Random Effects (21.10) Consistent Consistent Inconsistent 


^ This table considers only consistency of estimators of G. For correct computation of standard errors see Sec- 
tion 21.2.3. 


where y;; is a scalar dependent variable, x;,; is a K x 1 vector of independent variables, 
Uit is a Scalar disturbance term, i indexes individual (or firm or country) in a cross 
section, and t indexes time. 

This model is too general and is not estimable as there are more parameters to 
estimate than observations. Further restrictions need to be placed on the extent to which 
a;, and B; vary with i and t, and on the behavior of the error tir. 


Pooled Model 


The most restrictive model is a pooled model that specifies constant coefficients, the 
usual assumption for cross-section analysis, so that 


Yit = A + X, B + uit. (21.1) 


If this model is correctly specified and regressors are uncorrelated with the error then 
it can be consistently estimated using pooled OLS. The error term is likely to be cor- 
related over time for a given individual, however, in which case the usual reported 
standard errors should not be used as they can be greatly downward biased. Further- 
more, the pooled OLS estimator is inconsistent if the fixed effects model, defined in 
the following, is appropriate. 


Individual and Time Dummies 


A simple variant of the model (21.1) permits intercepts to vary across individuals and 
over time while slope parameters do not. Then yj; = a; + y: + X; B + Uit, or 


N T 

Vir = BS at dj,it + Xo Ysds it + X;,3, (21.2) 
j=1 s=2 

where the N individual dummies d; ;; equal one if i = j and equal zero otherwise, 

the (T — 1) time dummies d, ;; equal one if t = s and equal zero otherwise, and it is 

assumed that x;; does not include an intercept. (If an intercept is included then one of 

the N individual dummies must be dropped). 
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This model has N + (T — 1) + dim[x] parameters that can be consistently esti- 
mated if both N — oo and T — ow. We focus on short panels where N — oo but T 
does not. Then the y, can be consistently estimated, so the (T — 1) time dummies are 
simply incorporated into the regressors x;,. The challenge then lies in estimating the 
parameters 6 controlling for the N individual intercepts œ;. One possibility is to in- 
stead have dummies for groups of observations, such as grouping by region, in which 
case the clustering methods of Chapter 24 are relevant. Here instead we specify a full 
set of N individual intercepts, which causes problems as N — oo. 


Fixed Effects and Random Effects Models 


The individual-specific effects model allows each cross-sectional unit to have a dif- 
ferent intercept term though all slopes are the same, so that 


Yit = Qi + xB + Eir, (21.3) 


where &;; is iid over i and t. This is a more parsimonious way to express model (21.2), 
with any time dummies included in the regressors x;;. The œ; are random variables that 
capture unobserved heterogeneity, already studied in Sections 18.2—18.5 and 20.4. 

Throughout this chapter we make the assumption of strong exogeneity or strict 
exogeneity 


Ele;;|@;, Xii, ---, Xir] = 0, Polen AE (21.4) 


so that the error term is assumed to have mean zero conditional on past, current, and 
future values of the regressors. Chamberlain (1980) gives a detailed discussion of ex- 
ogeneity assumptions and tests for exogeneity for panel data. Strong exogeneity rules 
out models with lagged dependent variables or with endogenous variables as regres- 
sors; these models are deferred to Chapter 22. 

One variant of the model (21.3) treats œ; as an unobserved random variable that is 
potentially correlated with the observed regressors x;;. This variant is called the fixed 
effects (FE) model as early treatments modeled these effects as parameters a1, ..., aw 
to be estimated. If fixed effects are present and correlated with x;, then many estima- 
tors such as pooled OLS are inconsistent. Instead, alternative estimation methods that 
eliminate the œ; are needed to ensure consistent estimation of 6 in a short panel. 

The other variant of the model (21.3) assumes that the unobservable individual ef- 
fects œ; are random variables that are distributed independently of the regressors. This 
model is called the random effects (RE) model, which usually makes the additional 
assumptions that 


a; ~ [a, o2], (21.5) 
Eit ~ [0, o; | ’ 
so that both the random effects and the error term in (21.3) are assumed to be iid. Note 
that no specific distributions have been specified in (21.5). A more precise term for this 


model is the one-way individual-specific random effects model, or more simply the 
random intercept model, to distinguish the model with more general random effects 
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models such as the mixed linear models presented in Section 22.8. Yet another name 
is the random components model. 

The term fixed effect is potentially misleading and the term random effect is more 
precisely a purely random effect. To avoid such confusion, M-J. Lee (2002) calls a 
fixed effect a “related effect” and a random effect an “unrelated effect.’ We use the 
traditional notation and terminology, but it should be clear that œ; is a random variable 
in both fixed and random effects models. 


Equicorrelated Model 


The RE model can be viewed as a specialization of the pooled model, as the a; can 
be subsumed into the error term. Then (21.3) can be viewed as regression of y;; on Xis 
with composite error term uj; = a; + £it, and (21.5) implies that 


og, t £S, 


Cov[(æ; + £it), (Œi + £is)] = | (21.6) 


og + o2, t=S. 
The RE model therefore imposes the constraint that the composite error u;, is equicor- 
related, since Cor[u;;, Uis] = oè / [o2 + of] for t ~ s does not vary with the time dif- 
ference t — s. Clearly, pooled OLS will be consistent but inefficient under the RE 
model. The random effects model is also called the equicorrelated model or ex- 
changeable errors model. 


Fixed versus Random Effects Models 


The fundamental distinction is between models with and without fixed effects. The 
modern econometrics literature emphasizes fixed effects, but we also provide details 
for the random effects model. 

Some authors, including Chamberlain (1980, 1984) and Wooldridge (2002), use the 
notation 


Yit = Ci + X, Ê + Fit 


in (21.3) to make it very clear that the individual effect is a random variable in both 
fixed and random effects models. Both models assume that 


Elyirlci, Xi] = ci + x;,8. 


The individual-specific effect c; is unknown and in short panels cannot be consis- 
tently estimated, so we cannot estimate E[y;,|c;, x;,]. Instead, we can eliminate c; by 
taking the expectation with respect to c;, leading to 


ELyir|Xiz] = E[c;j|xir] + x), 3. 


For the RE model it is assumed that E[c;|x;;] = œ, so El yi; |x;;] = @ + x;,( and hence 
it is possible to identify E[y;;|x;,]. In the FE model, however, E[c;|x;;] varies with 
Xj; and it is not known how it varies, so we cannot identify E[y;;|x;;]. It is nonethe- 
less possible to consistently estimate 3 in the FE model with short panels (as will be 
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discussed in the following). Thus it is possible in the FE model to identify the marginal 
effect 


B = Ely lci, Xir]/OXir, 


even though the conditional mean is not identified. For example, it is possible to iden- 
tify the effect on earnings of an additional year of schooling, controlling for individual 
effects, even though the individual effects and the conditional mean are not identified. 
In short panels the FE model permits only identification of the marginal effect 
JELyirlci, Xit]/3Xir, and even then only for time-varying regressors, so the marginal 
effect of race or gender, for example, is not identified. The RE model permits iden- 
tification of all components of 8 and of E[y;;|x;,], but the key RE assumption that 
E[c;|x;;] is constant is viewed as untenable in many microeconometrics applications. 


21.2.2. Panel Data Estimators 


We now introduce several commonly used panel data estimators of 3, with further 
detail provided in Sections 21.5—21.7. The estimators differ in the extent to which 
cross-section and time-series variation in the data are used, and their properties vary 
according to whether or not the fixed effects model is the appropriate model. 

A regressor x;, may be either time-invariant, with x;, = x; fort =1,..., T, or 
time-varying. For some estimators, notably the within and first differences estimators 
defined in the following, only the coefficients of time-varying regressors are identified. 


Pooled OLS 


The pooled OLS estimator is obtained by stacking the data over i and f into one long 
regression with NT observations, and estimating by OLS 


Yr =A +X, Btu, i=1,...,N, t=1,...,T. 


If Cov[uis, Xit] = 0 then either N — oo or T — œ is sufficient for consistency. 

The pooled OLS estimator is clearly consistent if the pooled model (21.1) is ap- 
propriate and regressors are uncorrelated with the error term. The usual OLS variance 
matrix based on iid errors, however, is not appropriate here as the errors for a given 
individual i are almost certainly positively correlated over t. The NT correlated obser- 
vations have less information than NT independent observations. 

To understand this correlation, note that for a given individual we expect consid- 
erable correlation in y over time, so that Cor[y;,, Yis] is high. Even after inclusion of 
regressors Cor[u;;, Uis] may remain nonzero, and it often can still be quite high. For 
example, if a model overpredicts individual earnings in one year it may also overpre- 
dict earnings for the same individual in other years. The RE model accommodates this 
correlation, with Cor[u;;, Uis] = o2/[o2 + o2] for t Æ s from (21.6). 

The usual OLS output treats each of the T years as independent pieces of informa- 
tion, but the information content is less than this given the positive error correlation. 
This leads to overstatement of estimator precision that can be very large, as illustrated 
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in Section 21.3.2 and formally demonstrated in Section 21.5.4. One therefore needs to 
use panel-corrected standard errors (see Section 21.2.3) whenever OLS is applied in 
a panel setting. Many corrections are possible, depending on the correlation and het- 
eroskedasticity structure assumed for the errors and whether the panel is short or long 
(see Section 21.5). 

The pooled OLS estimator is inconsistent if the true model is the fixed effects 
model. To see this, rewrite the model (21.3) as 


Yit = A + X; B + (a; — at 1). 


Then pooled OLS regression of y;; on x;; and an intercept leads to an inconsistent 
estimator of 6 if the individual effect a; is correlated with the regressors x;;, since 
such correlation implies that the combined error term (œ; — œ + &;;) is correlated with 
the regressors. 

In summary, pooled OLS is appropriate if the constant-coefficients or random ef- 
fects models are appropriate, but panel-corrected standard errors and t-statistics must 
be used for statistical inference. Pooled OLS is inconsistent if the fixed effects model 
is appropriate. 


Between Estimator 


The pooled OLS estimator uses variation over both time and cross-sectional units to 
estimate 8. 

The between estimator in short panels instead uses just the cross-sectional variation. 
Begin with the individual-specific effects model (21.3). Averaging over all years yields 
Ji = a; + X; 6 + Zi, which can be rewritten as the between model 


yj =at+XB+(q;-at+s), i=1,...,N, (21.7) 


where J; = T°, Yir, 8: = T7! DY, £in and & = T7! È, Xir- 

The between estimator is the OLS estimator from regression of y; on an intercept 
and x;. It uses variation between different individuals and is the analogue of cross- 
section regression, which is the special case T = 1. 

The between estimator is consistent if the regressors x; are independent of the com- 
posite error (œ; — œ + &;) in (21.7). This will be the case for the constant-coefficients 
model and the random effects model. In contrast, for the fixed effects model the be- 
tween estimator is inconsistent as œ; is then assumed to be correlated with x;, and 
hence x;. 


Within Estimator or Fixed Effects Estimator 


The within estimator is an estimator that, unlike the pooled OLS or between estimators, 
exploits the special features of panel data. In a short panel it measures the association 
between individual-specific deviations of regressors from their time-averaged values 
and individual-specific deviations of the dependent variable from its time-averaged 
value. This is done using the variation in the data over time. 
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Specifically, begin with the individual-specific effects model (21.3), which nests 
(21.1) as the special case a; = a. Then taking the average over time yields y; = a; + 
x, 6 + &;. Subtracting this from y;; in (21.3) yields the within model 


Vit — Yi = (Xir — Xi) B + (Eir — Ei), tS EE NV bel Ls (21.8) 


as the œ; terms cancel. 

The within estimator is the OLS estimator in (21.8). A special feature of this esti- 
mator is that it yields consistent estimates of 8 in the fixed effects model, whereas the 
pooled OLS and between estimators do not. 

From Section 21.6 the within estimator has several interpretations. It is called the 
fixed effects estimator as it is the efficient estimator of 3 in the model (21.3) if a; 
are fixed effects and the error ¢;; is iid. This chapter focuses on a literature that treats 
fixed effects as nuisance parameters that can be ignored since interest lies solely in 
estimation of 8. If instead the fixed effects are of interest they can also be estimated. 
In short panels these estimates of the individual œ; are inconsistent, though their distri- 
bution or their variation with a key variable may be informative. If N is not too large 
an alternative and simpler way to compute the within estimator is by least-squares 
dummy variable estimation. This directly estimates (21.2) by OLS regression of y;; 
on X;; and the N individual dummy variables and yields the within estimator for 6, 
along with estimates of the N fixed effects (see Section 21.6.4). Yet another interpreta- 
tion of the within estimator is the covariance estimator. Finally, taking deviations from 
individual-specific averages is equivalent to taking residuals from auxiliary regression 
of yi; and x;, on individual dummies and then working with the residuals. 

A major limitation of within estimation is that the coefficients of time-invariant 
regressors are not identified in the within model, since if x;; = x; then x; = x; so 
(xit — ži) = 0. Many studies seek to estimate the effect of time-invariant regressors. 
For example, in panel wage regressions we may be interested in the effect of gender or 
race. For this reason many practitioners prefer not to use the within estimator. Pooled 
OLS or random effects estimators permit estimation of coefficients of time-invariant 
regressors, but these estimators are inconsistent if the fixed effects model is the correct 
model. 


First-Differences Estimator 


The first-differences estimator also exploits the special features of panel data. In a short 
panel it measures the association between individual-specific one-period changes in 
regressors and individual-specific one-period changes in the dependent variable. 

Specifically, begin with the individual-specific effects model (21.3). Then lagging 
one period yields y;;—1 = a; +x/,,_,@+ &,;—1. Subtracting this from y; in (21.3) 
yields the first-differences model 


Yit — Yit-1 = (Xit — Xi) Ber tie), i=1,...,N, t=2,...,T, (21.9) 


as the a; terms cancel. 
The first-differences estimator is the OLS estimator in (21.9). Like the within esti- 
mator, this estimator yields consistent estimates of 6 in the fixed effects model, though 
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the coefficients of time-invariant regressors are not identified. The first-differences es- 
timator is less efficient than the within estimator for T > 2 if g; is iid. 


Random Effects Estimator 


The random effects estimator is an estimator that also exploits the special features of 
panel data. 

Begin with the individual-specific effects model (21.3), but assume a random effects 
model where œ; and ¢;; are iid as in (21.5). Pooled OLS is consistent but pooled GLS 
will be more efficient. The feasible GLS estimator (see Section 4.5.1) of the RE 
model, called the random effects estimator, can be calculated from OLS estimation 
of the transformed model 


Vir — Ji = (1 — Du + (Xir — ARB + Vir, (21.10) 


where v;; = (1 — Daj + (Ei — 28;) is asymptotically iid, and A is consistent for 


Oe 


à = 1- ——. (21.11) 
Vo? + Tog 
Section 21.7 provides a derivation of (21.10) and ways to estimate o2 and o? and 
hence to estimate A. Note that à = 0 corresponds to pooled OLS, A=1 corresponds 
to within estimation, and A > 1 as T —> 00. This is a two-step estimator of 6. 
The RE estimator is fully efficient under the RE model, though the efficiency gain 
compared to pooled OLS need not be great. It is inconsistent, however, if the fixed 
effects model is the correct model. 


21.2.3. Panel-Robust Statistical Inference 


The various panel models include error terms denoted uit, €;;, and œ;. In many microe- 
conometrics applications it is reasonable to assume independence over i. However, the 
errors are potentially (1) serially correlated (i.e., correlated over ¢ for given i) and/or 
(2) heteroskedastic. Valid statistical inference requires controlling for both of these 
factors. 

The White heteroskedastic consistent estimator of Section 4.4.5 is easily extended 
to short panels since for the ith observation the error variance matrix is of finite dimen- 
sion T x T while N — oo. Thus panel-robust standard errors can be obtained without 
assuming specific functional forms for either within-individual error correlation or het- 
eroskedasticity. More efficient estimators using GMM are deferred to Section 22.2.3. 

It is crucial to note that frequently the panel commands in many computer packages 
calculate default standard errors assuming iid model errors, leading to erroneous in- 
ference. In particular, for pooled OLS regression of y;; on x;,; without any control for 
individual effects it is very likely that Cov[u;,;, uis] > 0 for t Æ s. Ignoring this serial 
correlation can lead to greatly underestimated standard errors and over-estimated t- 
statistics, as demonstrated in the Section 21.3 data example and shown algebraically in 
Section 21.5.4. Once fixed or random individual-specific effects are included the serial 
correlation in errors can be greatly reduced, but it may not be completely eliminated. 
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Additionally, one may need to control for potential heteroskedasticity as is routinely 
done for cross-section data. 


Panel-Robust Sandwich Standard Errors 


The panel estimators of Section 21.2.2 can be obtained by OLS estimation of 0 in the 
pooled regression 


Ya = WO + Hin, (21.12) 


where different panel estimators correspond to different transformations Y;r, W;;, and 
Wir Of Vir, Wi, = [1 x], and ui. The key is that ;, is a known function of only 
Yil, +++, Yir, and similarly for w;, and T;. 

In the simplest case of pooled OLS, no transformation is necessary and 0 = [a 8'1. 
For the within estimator ¥;, = Yit — Yi, Wit = (Xi: — X;), where only time-varying re- 
gressors appear, and 0 equals the coefficients of the time-varying regressors. For first- 
differences estimation Y; = Yit — Yit—1, Wir = (Xir — Xir-1) and again only coefffi- 
cients of time-varying regressors are identified. For random effects Y; = Yit — Ayi and 
w = (Wi; — 2W;) and 0 = [a 3)’. Such transformations can induce serial correlation 
even if underlying errors are uncorrelated. 

It is convenient to stack observations over time periods for a given individual, lead- 
ing to 


J: = WO +0, 
where y; is a T x 1 vector in the preceding examples, except for the first-differences 


model where it is (T — 1) x 1, and W; is a T x q matrix or, for the first-differences 
model, a (T — 1) x q matrix. Further stacking over the N individuals yields 


¥ = We +4. 
Three representations of the OLS estimator are therefore 


Bors = [WWW 


where in the third expression the sum is from t = 2 to T in the case of the first- 
differences estimator. The most convenient representation to use varies with the 
context. 

To consider consistency, note that if the model is correctly specified then the usual 
algebra yields Bors = 0 + [WW]! WA or 


ly 
J W;,'t;. 
i=l 
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Given independence over i the essential condition for consistency is E[W,'t;] = 0. 
This generally requires a stronger assumption than E[w;;|w;;] = 0. A sufficient as- 
sumption is that of strong exogeneity given in (21.4). See Chapter 22 for estimation 
under assumptions weaker than strong exogeneity that permit, for example, lagged 
dependent variables as regressors. 

The asymptotic variance of Bors is then 


A N o a TTEN og ee ee 
V[Go1s] = p w XO W ERA WAW: bp wi] 


given independence of errors over i. Consistent estimation of V[@ors] in this panel 
setting is analogous to the cross-section problem of obtaining a consistent estimate of 
V[Oo1s] that is robust to heteroskedasticity of unknown form. The only complication 
is the appearance of a vector u; rather than a scalar u;, which poses no problem if the 
panel is short as then the dimension of u; is finite. 

This leads to a panel-robust estimate of the asymptotic variance matrix of the 
pooled OLS estimator, one that controls for both serial correlation and heteroskedas- 
ticity, given by 


N -l N N ee 
Mosi =| Sw, Waa Ww, baad ; (21.13) 
i=1 


where U; = ay =f; - W0. The estimator in (21.13) assumes independence over i 
and N — ox, the case for short panels, but otherwise permits V[u;,] and Cov[wj;, Utis] 
to vary with i, t, and s. An equivalent expression is 


where ti: = Yir — CA This estimator was proposed by Arellano (1987) for the fixed 
effects estimator. 

Panel-robust standard errors based on (21.13) can be computed by use of a regular 
OLS command, if the command has a cluster-robust standard error option (see Sec- 
tion 24.5.2). Since the clustering here is on the individual one selects the identifier for 
individual į as the cluster variable. This method was used to obtain the panel-robust 
standard errors given in Table 24.1. 

The term “robust” standard error can cause confusion. A common error made in 
pooled regression is to estimate the OLS regression (21.12) using the standard robust 
standard error option (see Section 4.4.5). However, this only adjusts for heteroskedas- 
ticity, and in practice in a panel setting it is much more important to correct for the 
correlation in individual errors. Another common error, though one that has smaller 
impact, is to use cluster-robust standard errors that assume homoskedasticity so that 
E[u;u;] is constant over i. 
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Panel Bootstrap Standard Errors 


The bootstrap method provides an alternative way to obtain panel-robust standard 
errors. The key assumption is that observations are independent over i, so one does 
a bootstrap pairs procedure that resamples with replacement over i and uses all ob- 


served time periods for a given individual. For data {(y;, X;),i = 1,..., N} this yields 
B pseudo-samples and for each pseudo-sample one performs OLS regression of Y; 
on W;;, yielding B estimates 6), b=1,...,B. 
The panel bootstrap estimate of the variance matrix is then 
Sosy’, ky 1 Ly ee See 
Pooll = gD (6, - 8) (6,6) , (21.14) 


where 0 = B~! bw Op. This bootstrap provides no asymptotic refinement (see Sec- 
tion 11.2.2). Given independence over i the estimate is consistent as N — oo. It is 
asymptotically equivalent to the estimate (21.13), just as in the cross-section case 
bootstrap pairs are asymptotically equivalent to White’s heteroskedastic consistent es- 
timate. This bootstrap does not offer an asymptotic refinement though bootstrap with 
asymptotic refinement is possible (see Section 11.6.2). 

This bootstrap method can be applied to any panel estimator that relies on 
independence over i and N — oo, including the pooled feasible GLS estimators of 
Section 21.5.2 for short panels. The key is to resample over i only, and not over both 
i and t. 


Discussion 


The importance of correcting standard errors for serial correlation in errors at the indi- 
vidual level cannot be overemphasized. Computer packages currently do not automat- 
ically do this. Bertrand, Duflo, and Mullainathan (2004) illustrate the resulting down- 
ward bias in standard error computation, in the context of difference-in-differences es- 
timation (see Section 22.6). They find that the panel-robust and panel bootstrap meth- 
ods work well, even though in their application with state-year data N (the number of 
states) is relatively small whereas the asymptotic theory uses N —> oo. 

The following example (see Table 21.2) also shows the importance of correcting 
standard errors for any error serial correlation and autocorrelation. 


21.3. Linear Panel Example: Hours and Wages 


An important issue in labor economics is the responsiveness of labor supply to wages. 
The standard textbook model of labor supply suggests that for people already working 
the effect of a wage increase on labor supply is ambiguous, with an income effect 
pushing in the direction of less work offsetting a substitution effect in the direction of 
more work. 

Cross-section analysis for adult males finds a relatively small positive response to 
hours worked. However, it is possible that this association is spurious, merely reflect- 
ing a greater unobserved desire to work being positively associated with higher wages. 
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Panel data analysis can control for this, under the assumption that the unobserved de- 
sire to work is time-invariant. For example, the within estimator does so by measuring 
the extent to which an individual works above-average (or below-average) hours in 
periods with above-average (or below-average) wages. 

The data on 532 males for each of the 10 years from 1979 to 1988 come from Ziliak 
(1997). The variable of interest is Inhrs, the natural logarithm of annual hours worked. 
The single explanatory variable is Inwg, the natural logarithm of hourly wage. We 
consider the regression model 


Inhrs;; = a; + Blnwg;; + Eit, 


where the individual-specific effect œ; is simplified to œ in some models and 6 mea- 
sures the wage elasticity of labor supply. The error term ¢;, is assumed to be indepen- 
dent over i, but it may be correlated over t for given i. As noted we expect £, the labor 
supply elasticity, to be small and positive. 

Ziliak (1997) additionally included a quadratic in age, number of children, and an 
indicator variable for bad health. These regressors and year dummies make relatively 
small difference to the estimate of 6 and its standard error, and for simplicity they are 
omitted here. In Chapter 22 we consider more general models that permit Inwg to be 
endogenous and permit lags of Inhrs to appear as a regressor. 


21.3.1. Data Summary 


For the 5,320 observations, the sample means of Inhrs and Inwg are respectively 7.66 
and 2.61, implying geometric means of 2,120 hours and $13.60 per hour. The sam- 
ple standard deviations are respectively 0.29 and 0.43, indicating considerably greater 
variability in percentage terms in wages rather than hours. 

For panel data it is useful to know whether variability is mostly across individuals 
or across time. The total variation of a series x;, around its grand mean x can be 
decomposed as 


N T N T 
YG — 29 SY Ga i) + E - DP 


i=l t= i=l t= 
N T 
DHIE EEG: = 5), 
i=l t=1 i=] t=1 


as the cross-product term sums to zero. In words, the total sum of squares equals 
the within sum of squares plus the between sum of squares. This leads to within 
standard deviation sw and between standard deviation sg, where 


1 N T 
st, = NT-N ys > Gu a zy 


i=1 t=1 


and 
2 l 3 2 
B= oie N . 
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Table 21.2. Hours and Wages: Standard Linear Panel Model Estimators* 


POLS Between Within First Diff  RE-GLS RE-MLE 


a 7.442 7.483 7.220 001 7.346 7.346 
B 083 067 168 109 119 120 
Robust se?  (.030) (024) (085) (084) (051) (.052) 
Boot se [.030] [.019] [.084] [.083] [.056] [.058] 
Default se {009} {020} {.019} {.021} {014} {.014} 
R? 015 021 016 008 014 014 
RMSE 283 .177 233 296 233 233 
RSS 427.225 0.363 259.398 417.944 288.860 288.612 
TSS 433.831 17.015 263.677. 420.223 293.023 292.773 
Ow 000 181 161 162 
Oe 283 232 233 233 
a 0.000 z 1.000 2 585 586 
N 5320 532 5320 4788 5320 5320 


4 Shown are pooled OLS (POLS), between, within, first-differences, random effects (RE) GLS and MLE linear 
panel regression of Inhrs on Inwg. Standard errors for the slope coefficients are panel robust in parentheses, 
panel bootstrap in square brackets, and default estimates that assume iid errors in curly braces. The R?, root 
mean square error (RMSE), residual sum of squares (RSS), total sum of squares (TSS), and sample size come 
from the appropriate regression given in Section 21.2. The parameter A is defined after (21.11). 

se, standard error. 


b 

The within and between sample standard deviations are, respectively, 0.22 and 0.18 
for Inhrs and 0.19 and 0.39 for Inwg. The larger total variation in wages compared to 
hours is therefore due to between individual variation being much higher for wages. 
Within individuals the variation is actually somewhat smaller for wages than it is for 
hours. 


21.3.2. Comparison of Panel Data Estimators 


Table 21.2 summarizes results from application of the standard panel estimators de- 
fined in Section 21.2.2 to these data, along with three different estimates of the stan- 
dard errors. As detailed in the following, statistical inference should use either the 
panel-robust standard error or the panel bootstrap standard error. 


Slope Parameter Estimates 


The estimate of the slope parameter £ differs across the different estimation methods. 
The between estimate that uses only cross-section variation is less than the pooled OLS 
estimate. The within or fixed effects estimate of 0.168 is much higher than the pooled 
OLS estimate of 0.083 and is borderline statistically significant using a two-tailed test 
at 5% and standard error estimate of 0.084 or 0.085. The first-differences estimate of 
0.109 is also higher than that of pooled OLS but is considerably less than the within 
estimate, which also uses only time-series variation. The RE estimates of 0.119 or 
0.120 lie between the between and within estimates. This is expected, as RE estimates 
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can be shown to be a weighted average of between and within estimates. The two 
RE estimates are very close to each other as here the estimates of the variances o and 


a? are similar, leading to similar values 2 = 0.585 and A = 0.586 in the regression 


A 
(21.10). The RE estimates are surprisingly less efficient than the pooled OLS estimates, 
a sign that the RE model fails to model the error correlation well. 

Which estimates are preferred? The within and first-difference estimators are con- 
sistent under all models (pooled, RE, and FE) whereas the other estimators are in- 
consistent under the fixed effects model. The most robust estimates are therefore the 
within or first-differences estimates of 0.168 or 0.109. 

There is, however, an efficiency loss in using these more robust estimators, with 
standard errors of 0.83 to 0.85 that are much larger than those from pooled OLS and 
RE estimates. A formal Hausman test (see Section 21.4.3 for details and discussion) 
can be used to test whether or not the individual effects are fixed. Given the relative 
imprecision of estimation in this example, the Hausman test does not reject the null 
hypothesis of random effects, despite the large difference between FE and RE esti- 
mates. So the more efficient random effects estimates could be used here. Another 
advantage of random effects estimation is that it permits estimation of the coefficients 
of time-invariant estimators. 


Standard Error Estimation 


We now turn to comparison of the standard error estimates. From Section 21.2.3, in- 
ference should be based on panel-robust standard errors that permit errors to be corre- 
lated over time for a given individual and to have variances and covariances that differ 
across individuals. Also, as detailed in later sections, the standard errors for estimators 
based on deviations from means, such as (21.8) and (21.10), need to account for loss 
of N + K rather than K degrees of freedom. 

The first standard error estimate is computed by the panel-robust method given in 
(21.13), and the second is computed by the panel bootstrap given in (21.14) with 500 
replications. For brevity these estimates are called panel robust, though they are addi- 
tionally robust to heteroskedasticity. The two estimates are very close, aside from the 
random effects models where the panel-robust standard errors are underestimated be- 
cause they are computed for the regression (21.10), which ignores estimation error in De 

The third standard error estimate is the standard default computer output that is 
based on the assumption of iid errors. In this example the correctly estimated standard 
errors are a remarkable three to four times as large as the default standard errors. The 
one exception is the between estimator, an estimator with standard errors that need 
only correction for heteroskedasticity since it uses only cross-section variation. 

For example, for the pooled OLS estimator of 6 the default standard error is 0.09, 
leading to incorrect t-statistic of 9.07. The panel-robust standard error is a much 
larger 0.30, leading to correct t-statistic of a much smaller 2.83. Default standard er- 
rors assume independence of model errors over t for given i when in practice they 
are likely to be positively correlated. This erroneous assumption overestimates the 
benefit of additional time periods, leading to downward bias in standard errors (see 
Section 21.5.4). Additionally, ignoring heteroskedasticity in errors also leads to bias, 
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though this bias could be in either direction. For these data a failure to control for 
heteroskedasticity also imparts a large downward bias: The standard error of Brors 
controlling for heteroskedasticity, but not for correlation over t for given i, is 0.020. 
For other data, correction for heteroskedasticity is usually much less important than 
the correction for panel correlation. 

For the within and between estimators inclusion of the term œ; should control for 
some of the correlation in the error across time for a given individual. For these data, 
however, the differences between panel-robust and nonrobust standard errors remain 
large, in part because of failure to additionally control for heteroskedasticity. 

Clearly panel-robust standard errors should be used. 


21.3.3. Graphical Analysis 


It is insightful to perform a graphical comparison of overall, between, and fixed effects 
(within or first-differences) regressions. Such plots are rarely performed in panel data 
regression, but they are easily applied here as there is only one regressor. 

All plots include a nonparametric regression curve using the Lowess smoother (see 
Section 9.6.2) and a linear regression curve that corresponds to the estimates given in 
Table 21.2. 

Figure 21.1 plots Inhrs against Inwg for all firms in all years (5,320 observations). 
The plot suggests a positive relationship, roughly linear except at the extreme ends, 
and from Table 21.2 the line has slope 0.083 with a low R? = 0.015. 

The between estimator (21.7) regresses y; on x;. The corresponding plot for the 
Inhrs—Inwg data is given in Figure 21.2 and again shows a positive relationship. 

The within or fixed effects estimator (21.8) regresses (y; — yi) on (Xir — Xi). 
Figure 21.3 gives the related plot of (yi — Yi + Y) on (xir — Xi + X), where y = 
N7!5°, Ji and X = N`! 9°, X; are the grand means of y and x. Comparison with Fig- 
ure 21.1 shows that differencing the individual mean leads to a considerable decrease 
in the range of variability in Inwg, with less of a decrease in the variability of Inhrs. 


Pooled (Overall) Regression 
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§ ~ 
Original data 
r En Ee Nonparametric fit 
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o 1 2 3 4 5 
Log hourly wage 
Figure 21.1: Hours and wages: pooled (overall) regression. Natural logarithm of annual 
hours worked plotted against natural logarithm of hourly wage. Data for 532 U.S. males for 
each of the ten years 1979-88. 
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Figure 21.2: Hours and wages: between regression. Ten-year average of log hours plotted 
against ten-year average of log wage for 532 men. Same sample as Figure 21.1. 


The slope does appear steeper than that for pooled OLS, and from Table 21.2 the slope 
increased from 0.083 to 0.168. 

The first-differences estimator (21.9) regresses (yj; — Yi,t—1) ON (Xit — Xj,2-1). The 
corresponding plot for the Inhrs — Inwg data is given in Figure 21.4. The figure is 
qualitatively similar to Figure 21.3. 

The conclusion of the preceding analysis is that there is greater response to wage 
changes using time-series variation than using cross-section variation. 


21.3.4. Residual Analysis 


It is instructive to consider the autocorrelation patterns of the data and of residuals. For 
example, for residuals @;, = Yit — Yı, the autocorrelation between period s and period 
t is calculated as P, = Cst/ JCssCit, Sst = 1,..., T, where the covariance estimate 
Cs = (N — I Pech = (is — Us) and T; =i DG Hit. 


Within (Fixed Effects) Regression 
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Figure 21.3: Hours and wages: within (fixed effects) regression. Deviation of log hours 


from ten-year average plotted against deviation of log wage from ten-year average using 
ten years of data for 532 men. Same sample as Figure 21.1. 
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o First Differences Regression 


Log annual hours 


First differences 
ee Nonparametric fit 
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Figure 21.4: Hours and wages: first differences regression. First difference of log hours 


plotted against first difference of log wage using ten years of data for 532 men. Same 
sample as Figure 21.1. 


Table 21.3 gives the residual autocorrelations after pooled OLS regression of Inhrs 
on Inwg. The autocorrelations generally lie between 0.2 and 0.4 for data two to nine 
periods apart. The decay rate is very slow, and the autocorrelations appear closer to a 
random effects model that assumes that Cor[u;;, Uis] is constant for t Æ s than to an 
AR(1) model that has exponential decay. 

The autocorrelations for Inhrs before regression are very similar to those given in 
Table 21.3, since @;; ~ y;; as evident from the poor explanatory power of pooled OLS 
with R? = 0.015. The autocorrelations for the regressor Inwg, not tabulated here, are 
much higher, ranging from approximately 0.9 at one lag, to 0.7 at nine lags. 

The correlations of the residuals from the within regression are given in Table 21.4. 
If the original errors ¢;, in (21.3) are iid then it can be shown that the transformed 
errors €;; — &; have autocorrelations at all lags equal to —1/(T — 1) = —0.11. There 
is some departure from this here, particularly for the first lag, which is always positive. 


Table 21.3. Hours and Wages: Autocorrelations of Pooled OLS Residuals* 


u79 u80 u81 u82 u83 u84 u85 u86 u87 u88 


upols79 1.00 

upols80 .33 1.00 

upols8 1 44 .40 1.00 

upols82 .30 31 .57 1.00 

upols83 21 .23 .37 .47 1.00 

upols84 .20 .23 32 34 64 1.00 

upols85 24 32 Al 35 39 58 1.00 

upols86 .20 19 28 .25 31 35 40 1.00 

upols87 .20 32 33 29 31 34 39 35 1.00 
upols88 16 25 30 .26 21 .25 34 55 53 1.00 


“ Note: Autocorrelations of residuals are from pooled OLS regression of Inhrs on Inwg for 532 men in 10 years. 
The autocorrelations die slowly. 
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Table 21.4. Hours and Wages: Autocorrelations of Within Regression Residuals’ 


u79 u80 u81 u82 u83 u84 u85 u86 u87 u88 


ufe79 1.00 

ufe80 .10 1.00 

ufe81 21 .08 1.00 

ufe82 .00 —.04 26 1.00 

ufe83 26 27 21 .01 1.00 

ufe84 .26 27 30 20 32 1.00 

ufe85 18 10 11 17 16 7 1.00 

ufe86 19 25 26 27 17 14 .08 1.00 

ufe87 15 .05 .16 .20 .24 .21 .09 .09 1.00 

ufe88 .17 .11 .14 .18 .38 .31 .13 .24 .24 1.00 


“ Autocorrelations of residuals are from within (fixed effects) regression of Inhrs on Inwg for 532 men in 10 
years. 


The correlations of the residuals from random effects regression are quite simi- 
lar to those for fixed effects given in Table 21.4. The correlations of residuals from 
first-differences regression are qualitatively similar to the theoretical result that if the 
original errors €;, in (21.3) are iid then the transformed errors £; — &;;_; have autocor- 
relations of 0.5 at lag one and 0 at other lags. 


21.4. Fixed Effects versus Random Effects Models 


The fixed effects model has the attraction of allowing one to use panel data to establish 
causation under weaker assumptions (presented in Section 21.4.1) than those needed 
to establish causation with cross-section data or with panel data models without fixed 
effects, such as pooled models and random effects models. 

In some studies causation is clear, so random effects may be appropriate. For exam- 
ple, in a controlled experiment such as crop yield from different amounts of fertilizers 
applied to different fields the causation is clear. In other cases it may be sufficient to 
use a random effects analysis to measure the extent of correlation, with determination 
of causation left to further research taking other approaches. The effect of smoking on 
lung cancer is an example. Economists are unusual in instead preferring a fixed effects 
approach, however, because of a desire to measure causation in spite of reliance on 
observational data. 

The fixed effects model has several practical weaknesses. Estimation of the coeffi- 
cient of any time-invariant regressor, such as an indicator variable for gender, is not 
possible as it is absorbed into the individual-specific effect. Coefficients of time- 
varying regressors are estimable, but these estimates may be very imprecise if most 
of the variation in a regressor is cross sectional rather than over time. Prediction of the 
conditional mean is not possible. Instead, only changes in the conditional mean caused 
by changes in time-varying regressors can be predicted. Even coefficients of time- 
varying regressors may be difficult or theoretically impossible to identify in nonlinear 
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models with fixed effects (see Chapter 23). For these reasons economists also use ran- 
dom effects models, even if causal interpretation may then be unwarranted. 


21.4.1. Fixed Effects Example 


Consider the effect of computer use on wage. Several cross-sectional studies, most no- 
tably those by Krueger (1993) and DiNardo and Pischke (1997), find that computer use 
in a job is associated with substantially higher wages, even after controlling for many 
determinants of the wage such as education, age, gender, industry, and occupation. As 
emphasized by DiNardo and Pischke (1997) this does not necessarily imply causa- 
tion, if regressors are correlated with the error term owing to endogeneity or omitted 
variables. 
Specifically, we suppose that in the cross section 


i =x, 3+aj + £i, 


where y is the natural logarithm of wage, x is a vector of individual characteristics 
including an indicator variable for computer use at work, and € is an error that is 
assumed to be independent of x. The complication is the addition of the unobserved 
variable a, which is assumed to be correlated with computer use at work, and hence 
with the observed regressors x, even though the components of x other than computer 
use, such as occupation and education, may partly control for computer use at work. 
Regression of y on x leads to omitted variables bias leading to inconsistent estimates 
of 8 as the combined error (œ + £) is correlated with x. 

Panel data offer a way around this problem, if we assume that the unobserved vari- 
able a; is time-invariant. Then 


/ 
Yit = X; b + Qi + Eit, 


where again ¢ is uncorrelated with x and @ is correlated with x. Differencing eliminates 
a; (see Section 21.2.2), permitting consistent estimation of 8. For the computer use 
example, the causative effect of computer use on wages is then measured by the associ- 
ation between individual changes in wages and individual movements to or from a job 
with a computer. Haisken-DeNew and Schmidt (1999) found no effect using German 
panel data. 

This fixed effects panel approach permits determination of causation under weaker 
assumptions than those of cross-section analysis, but it still requires assumptions. The 
key assumption is that the unobservables œ; are time-invariant, rather than being of 
more general form a;,. In the computer use example it is being assumed that an in- 
dividual’s propensity to have a job with a computer may be endogenous, but the un- 
observed component of the effect of this propensity œ; on wage is constant over time 
once we control for observables x;;. Essentially the particular time periods in which an 
individual’s job does or does not involve use of a computer are assumed to be purely 
random, once we control for time-invariant unobservable œ; and observable x;;. 

A random effects or pooled panel approach does not have similar properties. It 
instead assumes away the original concern that œ is correlated with x, since it as- 
sumes that « is iid [0, 07] and hence is uncorrelated with x. This leads to inconsistent 
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parameter estimates if in fact œ is correlated with x, whereas the fixed effects estimator 
is consistent if œ is correlated with x, provided œ is time-invariant. 


21.4.2. Conditional versus Marginal Analysis 


Fixed effects estimation is a conditional analysis, measuring the effect of x;; on yj; 
controlling for the individual effect @;. Prediction is possible only for individuals in 
the particular sample being used, and even then it is only possible if the panel is long 
enough so that a; can be consistently estimated. Random effects estimation is instead 
an example of marginal analysis or population-averaged analysis, as the individual 
effects are integrated out as iid random variables. The random effects estimators can 
be applied outside the sample. 

If the true model is a random effects model, then whether to perform a conditional 
or marginal analysis will vary with the application. If analysis is for a random sample 
of countries then one uses random effects, but if one is intrinsically interested in the 
particular countries in the sample then one does fixed effects estimation even though 
this can entail a loss of efficiency. 

If the true model instead has individual-specific effects correlated with regressors, 
however, then a random effects analysis is no longer meaningful as the random effects 
estimator is inconsistent. Instead, alternative estimators such as the fixed effects and 
first-differences estimators are necessary. Because of the desire to determine causation 
microeconomic applications emphasize these latter estimators. 


21.4.3. Hausman Test 


If individual effects are fixed the within estimator Bw is consistent whereas the random 
effects estimator Bgg is inconsistent. Here 6 refers to the vector of coefficients of just 
the time-varying regressors. One can therefore test whether fixed effects are present by 
using a Hausman test of whether there is a statistically significant difference between 
these estimators. Alternatively, any other pair of estimators with similar properties, 
such as first differences versus pooled OLS, can be used. 

A large value of the Hausman test statistic leads to rejection of the null hypothesis 
that the individual-specific effects are uncorrelated with regressors and to the conclu- 
sion that fixed effects are present. It may still be possible to avoid using a fixed effects 
model. If regressors are correlated with individual-specific effects caused by omitted 
variables, then one can add further regressors, either time varying or time-invariant, 
and again perform a Hausman test in this larger model to see whether fixed effects are 
still necessary. Even if such correlation persists it may be possible to estimate a random 
effects model using instrumental variables methods (see Sections 22.4.3—22.4.4). 


Computation When RE Is Fully Efficient 


We begin by assuming that the true model is the random effects model (21.3) with œ; 
iid [0, og] uncorrelated with regressors and error ¢;; iid [0, o2]. 
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Then the estimator Bre is fully efficient, so from Section 8.3 the Hausman test 
statistic simplifies to 


H= (Gite = Gis) [VIB w] = Við rel (Bits = Bw) , 


where 8; denotes the subcomponent of B corresponding to time-varying regressors 
since only that component can be estimated by the within estimator. This test stastistic 
is asymptotically x?(dim[6;]) distributed under the null hypothesis. 

Hausman (1978) showed that an asymptotically equivalent version of this test is to 
perform a Wald test of y = 0 in the auxiliary OLS regression, 


Vir — 9; = (1 — Du + Kii — ARV) By + Orie — Rii) Y + Vir, (21.15) 


where Xj;; denotes the time-varying regressors and 2 is defined in (21.11) and only 
the time-varying regressors are used. This algebraic result can be interpreted as fol- 
lows. The individual-specific effects model (21.10) implies that v; = (1 — Dai + 
(Eit — 2G;). The random effects estimator is actually obtained by OLS estimation of 
(21.15) with y = 0 (see (21.10)). If instead the fixed effects specification is valid then 
the error v;; will be correlated with the regressors, via correlation of œ; with regres- 
sors. This correlation leads to additional functions of the regressors, such as (x; — X;), 
being statistically significant variables in (21.15). 


Computation When RE Is Not Fully Efficient 


The simple form of the Hausman test is invalid if œ; or ¢;; are not iid, which is 
more than likely given heteroskedasticity inherent in much microeconometrics data. 
Then the RE estimator is not fully efficient under the null hypothesis so the expres- 
sion VIB y] — Vi Brel i in the formula for H needs to be replaced by the more general 
Vibes — Bw] (see Section 8.3). 

For short panels this variance matrix can be consistently estimated by bootstrap 
resampling over i (see Section 21.2.3). Then a panel-robust Hausman test statistic is 


HRobust = (Bire = Biw) [V Booli re = Awl] (Bire = Bw) , (21.16) 


where 


Visca Sipe _ Âd = =4 3 (6, 7 3) (6: -8) i 
b=1 


b denotes the bth of B bootstrap replications (see Section 21.2.3), and ô= By. RE — 
B iw: This test statistic can be applied to subcomponents of 3, and can use alternative 
estimators such as B tpots in place of B LRE and B LEp in place of B LW- 

Alternatively, Wooldridge (2002) suggests estimating the auxiliary OLS regression 
(21.15) and testing y = 0 using panel-robust standard errors. If the effects are random, 
though not necessarily such that œ; and ¢;; are iid, then v;; = (1 — Dai + (Eit — 26;) is 
still uncorrelated with regressors though v;; is no longer asymptotically iid, so cluster- 
robust standard errors need to be used. If the effects are fixed then the error v;; is 
correlated with the regressors, leading to significance of additional functions of the 
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regressors such as (x;; — X;). This robust version of the auxiliary regression for the 
Hausman test is preferred to one that assumes v;, is asymptotically iid, on the usual 
grounds of minimizing distributional assumptions. However, it is not clear whether 
this test actually coincides with the Hausman test when RE is inefficient. 


Hausman Test Example 


For the Inhrs—Inwg example estimates given in Table 21.2, a comparison of FE 
and RE estimates using the default standard errors yields H ~ (0.168 — 0.119)*/ 
(0.0197 — 0.014"). This leads to H = 14> X55(1) = 3.84, so the random effects 
model is rejected. 

This test is not appropriate, however. The statistic H is inflated because the usual 
standard errors in this example are greatly downward biased (see Section 21.3.2). Fur- 
thermore, this bias is a signal that the RE estimator is not fully efficient under Ho, so 
that the more general form of the Hausman test needs to be used. 

The auxiliary regression (21.15) yields a panel-robust t-statistic for Y of 1.28 and 
hence H* = 1.287 = 1.65, leading to nonrejection of the random effects model at 5%. 
Even though the wage elasticity estimates differ by 0.049, the estimates are sufficiently 
imprecise that the difference is not statistically significant. Note that if the nonrobust t- 
statistic for 7 is used instead, then t? = 13.69, close to the previous incorrect Hausman 
test statistic. 


21.4.4. Richer Models for Random Effects 


The random effects model specifies that the random effect œ; is distributed indepen- 
dently of regressors. Richer models, closer in spirit to fixed effects models, relax this 
assumption. 

Mundlak (1978) allowed individual effects in the panel model (21.3) to be deter- 
mined by time averages of the regressors, so that œ; = X;a + w;, where w; is iid. 
Then efficient GLS estimation of G and 7 in this expanded model leads to an estima- 
tor of G that equals the fixed effects estimator in model (21.3). By contrast the usual 
random effects estimator of G in model (21.3) that erroneously specifies iid random 
effects will be inconsistent. 

Chamberlain (1982, 1984) considered an even richer model for the random effects, 
with a; = Ximi +---+x>;77r + w;, a weighted sum of the regressors. He proposed 
estimation by minimum distance methods (see Section 22.2.7 for details), leading to 
an estimator of 6 that equals the fixed effects estimator. 

More generally, mixed linear models and hierarchical linear models of Section 24.6 
permit quite general models for random intercepts and also random slope parameters. 
Bayesian analysis of panel data also uses this framework. See Section 22.8 for details. 

In linear models the fixed effects approach is used if the unobserved individual 
effect is correlated with regressors. In more complicated models, such as nonlinear 
models, fixed effects models are not always estimable and richer random effects mod- 
els provide an alternative approach. 
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21.5. Pooled Models 


The pooled cross-section time-series model or constant-coefficients model is 
Yin = Q + Xir B + tis. (21.17) 


In the statistics literature the model is called a population-averaged model, as there 
is no explicit model of y;, conditional on individual effects. Instead, any individual 
effects have implicitly been averaged out. The random effects model is a special case 
where the error uj; is equicorrelated over t for given i (see Section 21.2.1). 

The main complication for statistical inference, assuming no fixed effects, is that 
the distribution of least-squares estimators of this model varies with the assumed dis- 
tribution of u;;. In short panels, panel-robust standard errors can be obtained using 
(21.13). 

Here we instead focus on GLS estimation using many of the different specifications, 
including equicorrelation, for the covariance structure of u;; over time and individuals 
that have been proposed in the literature. 

Although we focus on pooled GLS estimation of (21.17), a model without 
individual-specific fixed effects, the methods of this section can be applied more gen- 
erally to pooled GLS estimation of the transformed model (21.12) of Section 21.2.3. 


21.5.1. Pooled OLS, FGLS, and WLS Estimators 


It is convenient to use matrix notation. Combining observations over time for a given 
individual, define 


yi = W;ô + u;, (21.18) 


where 6 = [æ (3'J’ is a (K +1) x 1 parameter vector, y; and u; are T x 1 vectors 
with fth entries y;; and u;;, respectively, and W; is a T x (K + 1) matrix with tth row 
wi, =[1 _x;]. Stacking all individuals yields 


y = W6 +u, (21.19) 


where y and u are NT x 1 vectors, for example y = [y]... y4], and W is an 
NT x (K + 1) regressor matrix whose first column is a vector of ones. We assume 
that E[u|W] = 0, so errors are strictly exogenous, and define Q = E[uu’|W]. 

There are several possible least-squares estimators of this model, summarized in 
Table 21.5. 

First, pooled OLS is consistent and asymptotically normal. However, in a panel 
setting it is unlikely that Q =o7Iyr, so OLS is inefficient except in some special 
cases such as when all regressors are time-invariant. More importantly, the usual OLS 
variance estimate of o?(W’W)~! should not be used and a panel-robust estimate such 
as that in (21.13) needs to be used. 

Second, pooled feasible GLS (PFGLS) is consistent and fully efficient if Q is cor- 
rectly specified and Q is consistent for Q. Some of the very large range of structures 
on uis and hence Q that have been proposed in the panel literature and incorporated 
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Table 21.5. Pooled Least-Squares Estimators and Their Asymptotic Variances 


Estimator Formula‘ Variance Matrix? 


Pooled OLS: pors  (W'W)-! Wy (WW) WWW W)! 

Pooled FGLS: Ôprors (WO WWA y (WA Ww)! 

Pooled WLS: 6pwis (WE w wS y (wE w- wS-0S-wWw 
(W'S wW! 


1 The formulas are for tl the model y = W6 + u defined in (21.19) and error matrix Q. 
> For computation of Q for the variance matrices of POLS and PWLS see the text; in those cases Q 
need not be consistent for Q. For pooled FGLS it is assumed that Q is consistent for Q. 


into regression packages are given in Sections 21.5.2 and 21.5.3 for, respectively, short 
and long panels. 

Third, the pooled weighted LS (PWLS) estimator guards against misspecification 
of Q. It posits a working matrix © for the error variance matrix Q but then per- 
forms inference that is valid even if X # Q. Ordinary least squares is an example, 
with © =o7Iyr, but other choices of © may improve efficiency. 

Estimation of the variance matrix of the pooled OLS estimator requires an Q such 
that (NT)~!W'QW consistently estimates (NT) !'W'QW. 

For short panels this is possible by direct application of the results of Section 21.2.3. 
Estimation of the variance matrix of the pooled WLS estimator requires an Q such that 
(NT)! WES AS W consistently estimates (V7)~!'W'S~'Q:S~'W. The panel- 
robust estimate for OLS given in (21.13) can be adapted to pooled WLS by replacing 
WEQE W, or equivalently XW: =, E[u; u, |W;] £7 W; given independence 
over i, by the quantity }°; W; 6 a THT Wi, where U; = y; — Wi6. Alternatively, 
a panel bootstrap can be used. 


21.5.2. Error Variance Matrix for Short Panels 


In short panels there are few time periods but many individuals, usually peo- 
ple or firms. It is assumed that errors are independent over individuals, so that 
Cov[ujr, ujs] = 0, i A j. In such cases it is convenient to revert to summation no- 
tation. For example, the PFGLS estimator given in Table 21.5 becomes 


N -l y 
~ „A-1 „A-1 
Brrors = È W;'Q,; w. Yo W/O; yi, (21.20) 
i=l i=l 
where Q; is consistent for 

Q; = Efu;u|W;], (21.21) 


and Q; is nondiagonal as errors for a given individual are likely to be correlated over 
time. Note that Q; needs to come from estimation of a specified model for Q;, and we 
cannot use Q; = uu, (see the related discussion after equation (5.88)). 
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Equicorrelated Errors 


The most commonly used error structure is the random effects model presented in Sec- 
tion 21.2.1. Then from (21.6) Q; has common diagonal entries o + o? and common 
off-diagonal entries oĉ. Equivalently, the errors are equicorrelated, with Q; having 
common diagonal entries o? and common off-diagonal entries po”. Implementation 
of FGLS requires only estimation of of and oĉ, or of a” and p (see Sections 21.2.2 
and 21.7). 


ARMA Errors 


An alternative error structure is to assume an ARMA error model. For example, 
an AR(1) error model specifies that uj; = pui t—-1 + £i, where £; are iid. Then 
Cov[wir, uis] = p!’~*!o7. In this case the covariance between errors falls as the number 
of time periods between the errors increases. The RE model and an AR(1) error model 
are compared in Section 21.5.4. 

Baltagi and Li (1991) combine the two error models to consider a random effects 
model with AR(1) errors. This can be easily generalized to the AR(p) case, and meth- 
ods for moving average and ARMA errors (see Section 5.8.7) in random effects mod- 
els have also been developed more recently. A summary is given in Baltagi (2001, 
Chapter 5). 


Homoskedastic Errors with Unstructured Autocorrelation 


For FGLS estimation in short panels there is actually no need to impose as much 
structure as that imposed by an RE model or an AR(1) error model, if the assumption 
is made that the T x T matrix Q; is constant over i. Then there are “only” T(T + 1)/2 
covariance parameters to estimate. A consistent estimate of Q; is then Q; with (t, s)th 
entry rs = N ai D | Wittig. The preceding models also assume homoskedasticity, 
but place additional structure on Q;. 


Robust Inference 


All of the preceding specifications assume that error covariances are the same across 
individuals, which rules out heteroskedasticity. Provided the panel is short one can 
nonetheless use the preceding restrictive error variance matrix models as the basis 
for pooled WLS estimation, but then obtain robust standard errors as discussed af- 
ter Table 21.5. Alternatively, richer mixed models, presented in Chapter 22, can be 
estimated. 

The assumption of independence over i is maintained throughout Chapters 21-23, 
though it can be relaxed even for small T provided structure can be placed on the 
correlation. An example is an explicit model for spatial correlation for panel data on 
regions such as states or countries, with correlations declining as physical distance 
between individual observations increases. 
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21.5.3. Error Variance Matrix for Long Panels 


In long panels there are many time periods but relatively few individuals. Such data 
can arise in microeconometrics analysis if the individual observational unit is one of 
only a few regions, such as a state or country, or firms, but these are observed over 
enough time periods to base inference on the assumption that T — oo. 

Correlation across time for a given individual can be introduced using an ARMA 
model for the errors, with the parameters of the ARMA model permitted to differ 
across individuals as now N is fixed and T — oo. For example, consider an AR(1) 
error with uis = PjUj+—1 + Eit, where £i, ~ [0, of] is heteroskedastic and p; also dif- 
fers across individuals. Separate regressions of y;; on wj; with AR(1) errors for each 
individual using T time periods yields consistent estimates ; and G?, since T —> oo. 
These can then be used for feasible GLS estimation of 6 using all NT observations. 
For details see Kmenta (1986). This model permits both heteroskedasticity across in- 
dividuals and correlation over time for a given individual. Pesaran (2004) proposes a 
considerably richer model that is estimated by GLS. 

For long panels it is possible to introduce correlation across individuals, so that 
Cov[ujr, ujt] 4 Ofori Æ j, since N is fixed and asymptotic results rely on T — oo. In 
particular, one can perform pooled GLS estimation as done earlier, with the assumption 
of independence across individuals, but then calculate standard errors using the method 
of Newy and West (1987b), mentioned briefly in Section 6.4.4, that permits arbitrary 
cross-sectional dependence and serial dependence, provided the serial dependence dies 
away sufficiently fast. For details see Arellano (2003, p. 19). 

Time-series considerations for panel data are discussed in more detail in Section 
22.5 for models with lagged dependent variables as regressors. 


21.5.4. The Impact of Autocorrelated Errors 


Panel data regression models have errors that are usually autocorrelated over time 
for a given individual. If fixed effects are absent then pooled OLS regression gives 
consistent parameter estimates. However, the error correlation can lead to large bias 
in standard errors for pooled OLS if autocorrelation is ignored and to relatively small 
efficiency gains as the length of a panel is increased. 

The analysis is particularly simple for estimation of the mean of y based on T 
observations for one individual (so N = 1) with equicorrelation. Then y; = 6 + urs, 
and the OLS estimator is the sample mean, so B = ý = T7! §, y;. The OLS estimator 
has true variance VIB] = V[y] = T? X, >> ,Cov[u;, us]. Assuming equicorrelation 
the double sum has T variances equal to o? and T(T — 1) covariances all equal to po’. 
Hence V[¥] = T~!o7(1 +(T — 1)p). Thus the iid result that V[¥] = T~!o? needs to 
be modified by inflation by a multiple (1 + p(T — 1)). In particular V[¥] approaches 
o? asp > 1. 

Table 21.6 presents the impact of correlation on the variance of y for different values 
of T and p, where for simplicity we normalize ø? = 1. The precision of estimation 
falls considerably as p increases, and the estimate of V[y] under the assumption of 
independence given in the first column (assuming ø? is known for simplicity) can 
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Table 21.6. Variances of Pooled OLS Estimator with Equicorrelated Errors 


T p= 90.0 p=0.2 p=0.4 p= 0.6 p=0.8 p=1.0 
1 1.00 1.00 1.00 1.00 1.00 1.00 
2 0.50 0.60 0.70 0.80 0.90 1.00 
5 0.20 0.36 0.52 0.68 0.84 1.00 

10 0.10 0.28 0.46 0.64 0.82 1.00 


“ Given are the variances of the pooled OLS estimator as the correlation p of equicorrelated errors increases, 
for an intercept-only model with error variance normalized to one assuming errors are correlated though 
homoskedastic. 


greatly understate the true variance. Furthermore, for o > 0 the gain in precision due 
to increase in the number of time periods is much smaller than with independent data 
where a doubling of the number of time periods will halve estimator variance. For 
example, if o = 0.4 then with five time periods the estimator variance is only 0.52 
times that with one period, instead of the much lower multiple of 0.2 with independent 
data. Moreover, a doubling from 5 to 10 time periods leads to only a small reduction 
in estimator variance from 0.52 to 0.46. 

This result holds more generally for balanced panel regression with equicorrelated 
errors and regressors that are time-invariant, where the true variance of the OLS es- 
timator is (1 + p(T — 1)) times that assuming independent errors (see Kloek, 1981). 
In practice time-varying regressors are also included and clear analytical results are 
more difficult to obtain. For regression with intercept and single time-varying regres- 
sor, Scott and Holt (1982) show that the variance of the slope coefficient is inflated 
by the multiple (1 +p, p(T — 1)), where P, can be viewed as an estimate of the 
individual-specific autocorrelation in x. For panel data P, is often high so that there 
is still considerable inflation. These results also apply to other forms of clustered data 
and are presented in more detail in Section 24.5.2. 

The preceding analysis assumes equicorrelated errors, a property of the RE model. 
If instead errors are AR(1) there is greater benefit from increasing panel length. 
Then Cov[u;, us] = plo, so VIS] = T7207 [T +2 072 (T — s)p*]. For exam- 
ple, if o = 0.8 then V[y] = 0.7207 for T = 5 and 0.5407 for T = 10, lower than the 
corresponding values from Table 21.6 of 0.8407 and 0.8207 for equicorrelation with 
p = 0.8, but still much higher than values of 0.207 and 0.10? for p = 0.0. 

Microeconometricians gravitate to the RE model or equicorrelated error models for 
short panels as an outgrowth of the literature on clustered data presented in Chapter 24. 
For example, consider data on different siblings in a family for many families. Then 
it is natural to assume that correlations of unobservables across siblings in the same 
family are the same for different siblings pairs. For example, the correlation between 
the first and second siblings equals that between the first and third siblings. Those using 
long panel data instead often have a time-series background and naturally assume that 
correlation declines over time, leading to models such as an AR(1) error. 

Determining which model of time-series correlation is more reasonable really de- 
pends on the data. Many short panels used in microeconomics applications yield 
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pooled OLS residual autocorrelations that are qualitatively similar to those given in 
Table 21.3. These are closer to an RE model than an AR(1) model, though an 
ARMA(1,1) model may do well. Better still may be an RE model with AR(1) error. 
In all cases error correlation leads to a loss of information and the usual OLS standard 
errors understate the true standard errors. For short panels one can base inference on 
panel robust standard errors (see Section 21.2.3) that do not require specifying a model 
for the error correlation. 


21.5.5. Hours and Wages Pooled GLS Example 


A variety of pooled GLS estimates and associated default and robust standard errors 
of the model y;, = a; + 6x;; + Uir for the Inhrs on Inwg regression are given in Ta- 
ble 21.7. All assume the error u;,; is independent over i and identically distributed over 
i, and then have different assumptions on correlation in uj; over t. 

The first column of Table 21.7, for the pooled OLS estimator, repeats the first col- 
umn of Table 21.2. 

Pooled GLS estimates assuming equicorrelated errors are given in the second col- 
umn of Table 21.7. These coincide with the RE-GLS column in Table 21.2, since the 
random effects model implies equicorrelated errors (see (21.6)). 

Pooled GLS estimates assuming AR(1) errors, so that uit = ouj;—-1 + Eit where &;; 
is iid, are given in the third column of Table 21.7. The slope coefficient estimate is 
close to the pooled OLS estimate. 

Pooled GLS estimates with no structure placed on error correlation aside from 
homoskedasticity, so that Cov[u;;, Uis] = Ots, are given in the fourth column of Ta- 
ble 21.7. Then o;, is consistently estimated given small T by C, = N7! YL | Ditlis 
for all t and s. These are again close to the pooled OLS estimate. 

It is clear from Table 21.7 that panel-robust standard errors should be used rather 
than the default standard errors, which here assume homoskedasticity and correctly- 
specified model for serial correlation. 


Table 21.7. Hours and Wages: Pooled OLS and GLS Estimates“ 


A POLS PFGLS 
Estimator 
Error correlation None Equi ARI General 
a 7.442 7.346 7.440 7.426 
B .083 120 084 091 
Robust se (.029) (.052) (.037) (.050) 
Boot se [.032] [.060] [.050] [-] 
Default se {.009} {.014} {.012} {.014} 


“ Pooled OLS and GLS linear panel regression of Inhrs on Inwg for a short panel as- 
suming independence and identical distribution over i and no fixed effects. Pooled 
GLS estimators assume equicorrelated or random effects errors (equi), AR(1) errors 
(AR1), or no structure on the correlations (general). Standard errors for the slope 
coefficients are panel robust in parentheses, panel bootstrap in square brackets, and 
usual default estimates that assume iid errors in curly braces. 
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21.6. Fixed Effects Model 


The fixed effects model specifies 
Yit = Qi + Xb + ei, (21.22) 


where the individual-specific effects œ1, ..., œp measure unobserved heterogeneity 
that is possibly correlated with the regressors, x;, and 8 are K x 1 vectors, and to 
begin with the errors ¢;; are iid [0, o°]. 

The challenge for estimation is the presence of the N individual-specific effects 
that increase in number as N — oo. For practical purposes we are most interested 
in the K slope parameters G, which give the marginal effect of change in regressors 
since dE[y;;]/0x;, = B. The N parameters a;,...,@y are nuisance parameters or 
incidental parameters that are not of intrinsic interest. Nevertheless, their presence 
potentially prevents estimation of the parameters that are of interest. 

Remarkably, for the linear model there are several ways to consistently estimate 
despite the presence of these nuisance parameters. These include (1) OLS in the 
within model (21.8); (2) direct OLS estimation of the model (21.2) with indicator 
variables for each of the N fixed effects; (3) GLS in the within model (21.8); (4) ML 
estimation conditional on the individual means y;, i = 1,..., N; and (5) OLS in the 
first-differences model (21.9). 

The first two methods always lead to the same estimator for 3. So too does the 
third if additionally the ¢;, in (21.22) are iid and the fourth if e; ~ N’[0, o°]. The last 
estimator differs from the others for T > 2. Such equivalences generally do not hold 
in nonlinear models, which are considered in Chapter 23. 

The essential results for the within estimator are given in the next Section. The first- 
differences estimator, presented in Section 21.6.2, is extensively used in Chapter 22 
when regressors are no longer strongly exogenous. The other estimators are presented 
in the remainder of Section 21.6, which some readers may wish to skip. 


21.6.1. Within or Fixed Effects Estimator 


The within model is obtained by subtraction of the time-averaged model y; = a; + 
x;/3 + E; from the original model. Then 


Vit — Ji = (Kir — XV B + (€n — Ei), (21.23) 
so the fixed effect a; is eliminated, along with time-invariant regressors since x;; — 
x; = 0 if x; = x; for all t. 

Using OLS estimation yields the within estimator or fixed effects estimator Bw, 
where 


PE N T “ly fr 
By = | Soir — Rn — J YY œn — HOH — Fi). (21.24) 
i= i=l t=1 


1 t=1 


The individual fixed effects œ; can then be estimated by 


@;=3;-X% Bw, i=1,...,N. (21.25) 
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The estimate @; is unbiased for œ;, and it is consistent provided T — oo since Q; 
averages T observations. In short panels the estimates @; are inconsistent, but Bw is 
nonetheless consistent for 8. The œ; are viewed as nuisance parameters or ancillary 
parameters that fortunately do not need to be consistently estimated to obtain consis- 
tent estimates of the more important slope parameters 8. This remarkable result need 
not carry over to more complicated fixed effects models such as nonlinear models. 


Consistency of the Within Estimator 


The within estimator of ( is consistent if plim(NT)~' X; >, (xir — Xi XE — E) = 0 
This should happen if either N — oo or T — oo and 


Ele;, — &;|Xi, — Xi] = 0. (21.26) 


Owing to the presence of the averages x; = T >>; Xir and &; this condition is stronger 
than E[¢;;|x;;] = 0. A sufficient condition for (21.26) is the strong exogeneity condi- 
tion that E[é;;|xi1, ..., Xir] = 0. This precludes within estimation with lagged endoge- 
nous variables as regressors (see Section 22.5). 


Asymptotic Distribution of the Within Estimator 


The distribution of Bw appears potentially complicated because the error (€;, — &;) in 
the within model (21.8) is correlated over f for given i. It is shown in the following 
that the usual OLS results nonetheless apply. Under the strong assumption that ¢;, is 
iid, 


T =i 
V [Bw] = E NO kai [i (21.27) 


=1 t=1 


~ 


where X;; = Xi: — X;. A consistent and unbiased estimate of o? is e2 = [MT -1)— 
K]! $; 0), &,, where the degrees of freedom equal the sample size NT less the 
number of model parameters K and the N individual effects. Note that if the regression 
(21.23) is estimated using a standard least-squares package then we need to inflate the 
reported variances by [N(T — 1) — K]"'[NT — K]. 

For short panels (21.13) yields the robust estimate of the asymptotic variance 


N T SN T N T ol 
y]= |x| So kek, aa. | oye, , (21.28) 


i=l t=1 i=l t=1 


~ 


where &;; = £it — &;. This preferred estimate permits arbitrary autocorrelations for the 
€;, and arbitrary heteroskedasticity. 
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Derivation of the Variance of the Within Estimator 


We now derive the estimates of the variance of the within estimator given in (21.27) 
and (21.28), using matrix algebra. We begin with the model for the itth observation 


/ 
Vir = Qi +X b + Ein, 


where x;; and 8 are K x | vectors. For the ith individual, stack all T observations, so 


Yil 1 Xi £il 
=p aa : BE : |, TSN, 
Vir 1 Xir Er 
or 
y; = ea; + Xib + £i, ike Na (21.29) 
where e = (1, 1,..., 1y isa T x 1 vector of ones, X; is a T x K matrix, and y; and 


€; are T x | vectors. 
To transform model (21.29) to the within model, which subtracts the individual- 
specific mean, introduce the T x T matrix 


Q=I,-T lee’. (21.30) 
Premultiplication by the matrix Q creates deviations from the mean, since 
QW; = W; — ew;, (21.31) 


where W; is a T x m matrix with rth row w;, and W; = 77! 4 Wi; is am x 1 
vector of averages. The result (21.31) is obtained using eW; = TW;. Note also that 
QQ’ = Q, using e'e = T and Qe = 0, so Q is idempotent. 

Premultiplying the fixed effects model (21.29) for the ith individual by Q yields 


QY; = QX,;64+Qe;, i=1,...,N, (21.32) 


using Qe = 0. This is the within model (21.23), since equivalently y; — ey; = (X; — 
ex!) + (E; — e&;) using (21.31). Thus premultiplication by Q yields the within model. 
An OLS estimation of (21.32) yields Bw with variance matrix, assuming independence 
over i, equal to 


N 
V [bw] = 2 xox | 
i=l 


Begin with the strong the assumption that ¢;, are iid [0, o], so that e; are iid 
[0, 021]. The T x 1 error Qe; is then independent over i with mean zero and vari- 
ance V[Qe;] = QV[e;]Q' = o2QQ’ = o2Q. Then 


l y N -1 
XO X/Q’VIQe;|X;]OX; [$ xeox] . (21.33) 
i=1 i=1 


N N 
Y= X;Q'VIQs;IX;:]QX; = J` X/Q'o7QQX; 
i=1 i=1 


N 
= o? X X;Q'QX;, 


i=1 
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so that (21.33) simplifies to the estimate given in (21.27), using 


T 
(QXD (QX;) = SOK — & Kir — KY. 
=I 

At the time of writing many packages use (21.27) but alternative estimators may be 
better. In particular, the assumption of serially uncorrelated error ¢;, is easily relaxed. 
If e; are iid [0, &;] we use the more general form of the variance matrix (21.33) with 
Cov[Qe;, Qe ;] = 0, fori # j, and V[Qe;] replaced by (QE;)(QE;)’, where €; = y; — 
Xi Bw. This yields the estimate given in (21.28). 

From the derivation it should be clear that Bw is also consistent in the random 
effects model, though as shown in Section 21.7 it is less efficient than the random 
effects estimator if the random effects model is appropriate. 


GLS Estimation of the Within Model 


The within model (21.32) can also be estimated by feasible GLS. 

If in fact £; are iid [0, oĉ], however, then there are no gains to doing GLS. To see 
this, note that then Qeg; is independent of Qe ;, i # j, with V[Qe;] = o2Q, so the GLS 
estimator is 


N -l y 
Bwots = bs Koo ox,| X;Q'Q Qy,, 
i=l i 


i=1 


where the generalized inverse Q7 is used as Q is not of full rank. However, 
VQ Q=QQsince Q’'Q Q = Q, for a generalized inverse, and Q = QQ’ as Q here 
is idempotent. Replacing Q’Q Q by Q’Q in the formula for Bact yields the OLS 
estimator in (21.32). 

There can be gains to GLS if other models for ¢;, are assumed. The approach is 
essentially the same as that in Section 21.5.2 for pooled GLS without fixed effects, 
except that first the fixed effect must be eliminated. This leads to error Qe; that is less 
than full rank, so we first drop one time period and apply pooled GLS to only (T — 1) 
time periods. It is easier, and often not much less efficient, to instead just use the usual 
within FE estimator and then obtain panel-robust standard errors using (21.28). 

MaCurdy (1982b) gives a Box—Jenkins-type analysis for identification and estima- 
tion of ARMA processes for ¢;; in a fixed effects model for a short panel. For short 
panels it is not necessary to assume an ARMA process for ¢;; or even stationarity, 
since for N —> co we can always consistently estimate Cov[uj;, Utis] by N =I Dar Wittig. 
Nonetheless, there may be interest in determining the ARMA process for the errors. 


21.6.2. First-Differences Estimator 


The within model is obtained by subtraction of the time-averaged model y; = a; + 
x;'8 + é; from the original model. Alternatively, one can subtract the model lagged 
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one period, y;;—-) = &; + Xj4-1'B+6),4-1. Then 
(vit — Yir—1) = Ke — Xit-1) B + (Eit — Gir), t= 2,...,T7, (21.34) 


so the fixed effect a; is eliminated. An OLS estimation yields the first-differences 
estimator 


pa N T LN oP 
Bro = bp Soir = Xi) (Kir — = > Soir = Xi t1 Oit — Yi,t—1). 


i=l t=2 i=l t=2 
(21.35) 


~ 


Note that there only N(T — 1) observations in this regression. An easy error to make 
in implementation is to stack all NT observations and then subtract the first lag. Then 
only the (1, 1) observation is dropped, whereas all T first-period observations (i, 1), 
i = 1,..., N, must be dropped after differencing. 


Consistency of the First-Differences Estimator 


Consistency of the first differences estimator requires that E[e;, — £i, t-1|Xit — Xi, t-11- 
This is a stronger condition than E[¢;,|x;;] = 0 but a weaker condition than the strong 
exogeneity condition needed for consistency of the within estimator. 


Asymptotic Distribution of the First-Differences Estimator 


Statistical inference requires adjusting the usual OLS standard errors to account for the 
correlation over time in the error term ¢;, — £; 1—1. To obtain the asymptotic variance 
of Brn. stack the model for the ith individual as 


Ay; = AX; 6B + Aé;, 


where Ay; is a (T — 1) x 1 vector with entries (yi2 — yi1),--., ir — Yi,r—1), and 
AX; isa (T — 1) x K vector with rows (xj. — X;1 V, ..., (XiT — X;,7-1)’. Then 
i N ly 
Bro = PASSI XO (AX D (Ay) (21.36) 
i=l i=l 


has variance matrix, assuming independence over i, of 


—1 -l 
V [ro] = PASSI [Z axovasiax Kax] PASSI 


i=1 i=1 i=1 
(21.37) 


The simplest assumption is that ¢;, are iid [0, oĉ]. Then the error (€; — £;i 1—1) is now 
an MA(1) error, with variance 2c? and one-period apart autocovariance oĉ for individ- 
ual i. It follows that V[Ae;] equals oe times a (T — 1) x (T — 1) matrix with entries 
of 2 on the diagonal, entries of 1 on the immediate off-diagonals, and Os elsewhere. 

A more realistic assumption is that ¢;; is correlated over time for given i, so 
that Cov[éj, Eis] Æ O for t Æ s, but is still independent over i. From (21.13), for 
short panels an estimator that is robust to general forms of autocorrelation and 
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heteroskedasticity is (21.37) with V[Ae;] replaced by (Ae;)'(Ae;). One should never 
use the usual OLS standard errors from OLS regression of the first-differences model 
(21.37), as these are only correct in the unlikely event that ¢;, is a random walk, so that 
(Eit = Ei t—1) are iid. 

For T = 2 the first-differences and within estimators are equal since y = (yı + 
y2)/2 so (yı — 5) = (yı — y2)/2 and (y2 — ¥) = —(yı — y2)/2, and similarly for x. 
For T > 2 the two estimators differ. Under the simplest assumption that ¢;, are iid, it 
can be shown that the GLS estimator of the first-difference model (21.34) equals the 
within estimator. The estimator Bro instead estimates (21.34) by OLS and is less effi- 
cient than Bw. For this reason the first-difference estimator is not mentioned much in 
introductory courses. However, it is used extensively once lagged dependent variables 
are introduced (see Chapter 22). Then the within estimator is inconsistent. The first- 
differences estimator is also inconsistent, but relies on weaker exogeneity assumptions 
that permit consistent IV estimation. 


21.6.3. Conditional ML Estimator 


The conditional MLE maximizes the joint likelihood of y11, ..., yyr conditional on 
the individual averages y),..., Yr. This method has the attraction that, for the linear 
panel model under normality, the fixed effects œ; are eliminated, so maximization is 
with respect to 8 alone. 

Assume that y;, conditional on regressors x;,; and parameters a;, 3, and g? are iid 
with normal distribution V[a; + x;,3, 07]. Then the conditional likelihood function 
is 


N 
Lconn(8,07, a) = | | FfOn,- vir iv) (21.38) 
i=l 
= Il fOis +++) ViTs Vi) 
pe fO) 


N Qro? 
= A eee SO -iyu — x), + Gi — %B))/20? | i 


The first equality defines the conditional likelihood assuming independence over i. 
The second equality always holds since, suppressing subscript i, f(y1,..., yr|¥) = 
fOr yr, D/S O) and fOr,- yr, Y) = f1.. yr) as knowledge of y = 
T! $`; yi adds nothing given knowledge of y1, ..., yr. The third equality under nor- 
mality comes after considerable algebra that is left as an exercise. 

The key result is that the fixed effects œ do not appear in the final equality in (21.38), 
so Lconp(G,07, œ) is in fact Lconp(3,o7), and we need to maximize the conditional 
log-likelihood function (21.38) with respect to 3 and ø? only. The resulting condi- 
tional ML estimator Bem solves the first-order conditions 

T N 
XOF [On — xp Xin — Oi — BE] = 0, 


t=1 i=1 


a RAN 
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or equivalently 


Mrs 


[Ou — YD) — Cie — XY BIC, — Xi) = O. 


N 


= 


t=l i 


However, these are just the first-order conditions from OLS regression of (yj; — yi) on 
(Xi; — Xi). he p 

The conditional MLE Gey, therefore equals the within estimator Bw. 

Intuitively, the method yields a consistent estimator because conditioning on Y; 
in (21.38) eliminated the fixed effects. More formally, y; is a sufficient statistic for 
a; and conditioning on the sufficient statistic enables consistent estimation of 3 (see 
Section 23.2.2). 


21.6.4. Least-Squares Dummy Variable Estimator 


Consider the original fixed effects model (21.22) before any differencing. An OLS 
analysis can be applied directly to the model, simultaneously estimating œ and ĝ. 

In principle no special software is needed. One simply estimates the OLS regression 
of yi; on x;; and a set of N indicator variables d, j;,..., dy ,it, where dj ip equals one 
if j =i and equals zero otherwise. However, as N gets large there are too many re- 
gressors to permit inversion of the (N + K) x (N + K) regressor matrix. Some matrix 
algebra, however, reduces the problem to inversion of a K x K matrix. 

The resulting estimator of 6 turns out to equal the within estimator. This is a spe- 
cial case of the so-called Frisch-Waugh Theorem for a subset regression. If dummy 
variables are partialled out by regression of all the variables on the dummies, and if 
the residuals from these regressions are used in a second stage regression, then we get 
the same estimates as in the full regression. But these residuals here are simply devia- 
tions from their respective means, i.e. the within regression. For completeness we now 
present the relevant matrix algebra. 

Stack the T x 1 vectors in (21.29) over all N individuals to yield the fixed effects 
dummy variable model 


yı e 0 0 OA X) E| 
: |50 > 0 = J+] i co) je 
YN 0 0 e QN Xy EN 
or 
Q 
y =[(y 8e) xi] +e (21.39) 


where y is an NT x 1 vector, the Kronecker product (Iy @ e) is an NT x N block- 
diagonal matrix, and X is the NT x K matrix of nonconstant regressors. 
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An OLS estimation of this model yields the least-squares dummy variable 
(LSDV) estimator 


Aispv | _ [0v8 An @e) dy @eYX] | [dw @eyy 
=|  X(Ly @e) XX X'y 


[TI TX Er j 
=| TX xXx X’y |’ 


where the matrix of sample means X = [x - a] & = To Sii Xi, Y= 


Bispv 


[Ji ++: Ja], and y; = T7! ae 1 Yir- Using the formula for partitioned inverse and per- 
forming further algebra leads to 


QLSDV y- XBy 
pu = CAER = : 21.40 
[5 | EEn E aD 


Reexpressing this in summation notation, we have Biso = Bw defined in (21.24) and 
QLspy = Qre defined in (21.25), so the LSDV estimators equal the within or fixed 
effects estimator 

For short panels an obvious potential problem is that consistent estimation of 3 
and œ is not guaranteed as there are N + K parameters to estimate and N — oo. 
Remarkably, consistent estimation of 8 is possible, even though œ is inconsistently 
estimated, unless additionally T — oo. 

This estimator is second-moment efficient if ¢;, are iid [0, 07]. It follows that the 
within estimator of 3 is more efficient than alternative differencing estimators that 
also eliminate œ;, such as subtracting the first observation or the previous period’s 
observation. If additionally the errors are normally distributed, the LSDV estimator 
equals the MLE by the usual equivalence of OLS and MLE in the linear model with 
spherical normal errors. 


21.6.5. Covariance Estimator 


Suppose data belong to one of N classes, with y,, denoting the tth observation in the 
ith class. The analysis of variance decomposes the total variation of y;; around the 
grand mean 7, >>; )>,(yi: — Y}, into within-group variation >, X, On — Ji + Y? 
and between-group variation )°,(3; — 5)’, where y; is the mean in the ith group. 
Group membership becomes more important as between-group variation increases. 
The analysis of covariance extends this approach to introduce regressors, in which 
case the residual sum of squares is similarly decomposed. This framework is widely 
used in applied statistics. 

For short panels each individual is viewed as a class, observed for several time 
periods. The model (21.3) is called the analysis-of-covariance model, as it permits 
the mean residual in the ith class to differ over classes. The estimator of this model, 
the within estimator, is accordingly also called the covariance estimator. 
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21.7. Random Effects Model 


The random effects model (21.3) can be rewritten as 
Ye = +x, B+ +6, i=1,...,N, t=1,...,T, (21.41) 
or 
Yit = WO + Qi + Ein, (21.42) 


where w;; = [1 x;;] and 6 = [yu 8'1. The individual-specific effects a; are assumed 
to be realizations of iid random variables with distribution [0, og] and the error ¢€;; is 
iid [0, oĉ]. The nonrandom scalar intercept u is added so that, unlike in (21.5), the 
random effects can be normalized to have zero mean. 

The model can alternatively be viewed as a special case of a random coefficient 
or varying coefficient model, where only the intercept coefficient is random. The 
model can be re-expressed as y;; = u + X; B + uir, where the error term u;; has two 
components Uit = a; + €;;. For this reason the random effects model is also called the 
error components model. Even clearer terminology may be the random intercept 
model. Richer mixed models also permit random slopes, see Chapter 22. 

There are many consistent estimators of the random effects model, including (1) 
GLS estimation in the model (21.42); (2) ML estimation in the model (21.42) assum- 
ing a; and ¢;; are normally distributed; (3) OLS estimation in the model (21.42); and 
(4) fixed effects model estimators such as the within and first-differences estimators, 
though these only estimate the coefficients of time-varying regressors. The first two 
estimators are asymptotically equivalent but can vary in finite samples depending on 
the specific estimates used for o and o2. The remaining estimators are consistent, 
though they are inefficient if in fact œ; and ¢;, are iid. 


21.7.1. GLS Estimator 


The random effects estimator of u and (3 is the feasible GLS estimator of the model 
(21.42), and it is shown later in this section that it can be implemented by OLS regres- 
sion of the transformed equation 


Vie — 29; = (1 — Du + (Xir — ARB + vi, (21.43) 


where v;; = (1 — Dai + (Ei — 2E;) and A is consistent for 


4 =1-4,/(To2 +0. (21.44) 
Equivalently, 
z m T x cusp e pa a 
Ske = Ea = l XO (wi — AW; (Wir — iw SO 9 (wa = AWG — 254), 
Bre i=1 t=1 i=l t=1 
(21.45) 


where w;; = [1 x;,] and W; = [1 x;]. Consistency requires NT — ox, through either 
N > worT — oor both. 
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Assuming that ¢;, and a; are iid, the usual OLS output from OLS regression of 
(21.43) can be used to obtain the variance matrix estimate, so that 


N T ra 

v Bae = ap NO Wia = AW) (Wi — iw (21.46) 
i=l t=1 

Alternatively, for short panels a robust variance estimate that permits quite general 

behavior for a; + £;i; can be obtained using (21.13). This yields 


oe N ly T T = 
ve] = Paza SISO HW, Fes Paza , (21.47) 
RE F y P 


where Wi; = Wir — Wi and En = Ey — Xe; where €;; is the RE residual. This estimate 
permits arbitrary autocorrelations for the ¢;, and arbitrary heteroskedasticity. 
Equation (21.46) requires consistent estimates of the variance components o? and 
og. From the within or fixed effects regression of (y; — yi) on (X;; — X;) we obtain 
1 pe 
A2 = =.\/ 2 
=n a — 5i) — (Xiu — 3i . 21.48 
NTD K 2 2O 5i) — (Xu — 3) Bw) (21.48) 
From the between regression of y; on an intercept and X;, an equation that has error 
term with variance o + o2/T, we obtain 


=a DG: ~ fn — VBa - a8. (21.49) 
More efficient estimators of the variance components a? and o are possible (see, for 
example, Amemiya, 1985), but these will not necessarily increase the efficiency of 
Bre- A wide range of estimators are possible. The variance estimator (21.49) can be 
negative, in which case programs often set £2 = 0, so 2 = 0 and estimation is then by 
pooled OLS. 

To verify that the feasible GLS estimator simplifies to OLS estimation of (21.43), 
stack (21.42) by observations from all T time periods for given i in the same way as 
for the fixed effects model. Then 


yi = W; + (ea; + £i), (21.50) 
where y;, e, €;, and X; are defined after (21.29), and W; =[e Xj]. To estimate by 


GLS we need to obtain the variance matrix Q of the T x 1 vector error (ea; + €i). 
Given independence of œ; and £; we have E[(ea; + €;)(ea; + €;)'] = Eleje;] + 
E[a? Jee’. Since ¢;, are iid [0, o2] and a; are iid [0, og] we obtain 


Q = oI; + ogee = of [o+ Jr- œ] l 


where Q = Ir — T~'ee’ was introduced in (21.30) and Y? = o? /lo? + Tog]. Using 
QQ’ = Q we can easily verify that Q7! = oz’[Q + y? (Ar —Q)] and 


1 
a = —[Q+ wdr-Q)]. (21.51) 
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The GLS estimator is obtained by premultiplication of (21.50) by any scalar multiple 
of 2-1/7, Now 


[Q+ wdr—-Q] yi = yi — e5; + Yi — (i — e5) 
= y; — hey, 


where A = (1 — y). Performing similar algebra for W;, ea;, and e; in (21.50) yields 
the following model: 


yi — Aey! = (W; — AW)'6 + (1 — Ja; + (e; — àe), (21.52) 


where the transformed error in (21.52) has variance matrix o7Ir. The GLS estimator 
is the OLS estimator of (21.52), but (21.52) is just a stacked version of (21.43) with 
the scalar à replaced by a consistent estimate. 

The random effects estimator Bre of the slope parameters converges to the within 
estimator as T — œ since then à —> 1. Otherwise, Bre can be shown, after some 
algebra, to equal a matrix-weighted combination of the within estimator and the 
between estimator. If the random effects model is appropriate, this weighted average 
works better than using the within estimator alone. However, if the fixed effects model 
is appropriate then this weighted average is inconsistent, as the between estimator is 
then inconsistent. The estimator of the intercept can be shown to simplify to pp = 
y— XBrz. For more details see, for example, Hsiao (2003, p. 36) or Greene (2003). 


21.7.2. ML Estimator 


In the derivation in the previous section, normality of the errors is not assumed. If they 
are in fact normal, we can maximize the log-likelihood function with respect to 6, y, 
aż, and 02. For given o? and o£ the MLE for and u is the same as the GLS estimator, 
but the MLE gives estimators 6? and G2 that differ from those given in (21.48) and 
(21.49). 

Thus the MLE for @ and u is given by (21.45) with a replaced by the alternative 
consistent estimate 4 = 1 — &./(T&2 + &2)'/*. Asymptotically, the MLE and GLS 
estimators of the random effects model are equivalent, but the two will differ in finite 
samples. 

For the MLE there may be two local maxima rather than one of the likelihood for 
0 < y? < 1, so care is needed to ensure a global maximum. 


21.7.3. Other Estimators 


Many different estimators of G are consistent if the random effects model is the cor- 
rect model. In particular, the pooled OLS, within, first-differences, and between es- 
timators are all consistent. However they are inefficient if œ; and ¢;; are iid, and 
the within and first-differences estimators can only estimate the coefficients of time- 
varying regressors. 
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21.8. Modeling Issues 


In this section we consider some practical issues that arise in linear panel data mod- 
els, even in the absence of complications such as endogeneity and lagged dependent 
variables, topics that are deferred to Chapter 22. 


21.8.1. Tests for Pooling 


The random effects model restricts all regression parameters to be the same in different 
cross sections and time periods, whereas the fixed effects models imposes parameter 
constancy except for the intercept, which may vary across individuals. Tests of poola- 
bility test the appropriateness of these constraints. 

These tests are usually done using a Chow test (see Greene, 2003, p. 130) based 
on the tests for equality of regressors in two linear regressions assuming a common 
variance. Depending on the assumptions about errors, the Chow test may be applied 
to models estimated by OLS or by GLS. Baltagi (2001, Chapter 4) and Hsiao (2003, 
Chapter 2) provide detailed coverage. 

For short panels it is not possible to allow the slope parameters to differ across 
individuals, as then the number of parameters goes to infinity. However, parameters can 
be permitted to vary over time. The model y;, = y + x;, + uir is then tested against 
Yit = Vt + Xi B, + Uis. The most obvious method is to assume random effects with 
Uit = £it+ Qi, estimate the restricted model (y, = y and G, = 6) using the random 
effects GLS estimator, and compare the restricted and unrestricted residual sums of 
squares in the transformed models. If more robust inference is preferred then panel- 
robust standard errors should be obtained and a Wald test performed. For short panels 
it is common to specify models with slope parameters 8 constant, though the intercept 
y, may be permitted to vary over time by inclusion of time dummies as additional 
regressors. 


21.8.2. Tests for Individual-Specific Effects 


Breusch and Pagan (1980) derived Lagrange-multiplier tests for the presence of 
individual-specific random effects against the null hypothesis assumption of iid er- 
rors. These have the advantage of being easily implemented by an auxiliary regression 
that requires only residuals from pooled OLS estimates. Alternatively, one can assume 
normality and do a likelihood ratio test of the random effects MLE against the MLE of 
the constant-coefficients model, or a Wald test of og = 0 in the random effects model. 

In practice one often rejects the null hypothesis that the errors in the constant- 
coefficients model are iid. It is easiest to immediately estimate by pooled OLS with 
panel-robust standard errors or by random effects GLS. 

For a short panel formal tests for the presence of individual-specific fixed effects 
are not possible because of the incidental parameters problem. It is not possible to 
test whether N parameters are zero when there are only NT observations and T is 
small. Instead, the Hausman test of Section 21.4.3 is used to test the null hypothesis of 
random effects against the alternative of fixed effects. 
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21.8.3. Prediction 


Prediction in models without individual effects is straightforward: Use Pjs = x’... 
This is a prediction of the population average E[y;,|xjs |. 

Prediction for a given individual conditional on the individual-specific effect is more 
difficult. This is prediction of E[yj,|xj;, @;]. We consider out-of-sample forecasts for 
the ith individual using the random effects model (21.42). Then y; ;+5 = W;,6 + Uir+s, 
where uj ;45 = Qi + &,445. The obvious predictor replaces 6 by ben and Ui t+s by ei- 
ther 0 or T;, where T; = Y; — Wore is the average within-sample residual for the ith 
individual. However, this is inefficient as it ignores the correlation between u;;45 and 
in-sample errors induced by the individual-specific random effect a;. The problem is 
an example of the more general problem of prediction within a GLS rather than an OLS 
framework. For this special case the best linear unbiased predictor (see Section 22.8.3) 
is Pitts = x! OnE + G Tož / (To? + o2))uj. For the fixed effects model the obvious pre- 
dictor is Vit+s = x’, Ow + @i,re, but again this is inconsistent in short panels. 


21.8.4. Two-Way Effects Models 


The analysis to date has focused on the one-way model, which is (21.1) with uj, = 
a; + €;;. A more general model is the two-way effects model, with u;i: = a; + y + 
€;, which additionally allows for time-specific effects. Then 


Yit =O ty +x, B+E i=1,...,N, t=1,...,T. (21.53) 


This model was presented originally in (21.2). 

As already noted, for short panels the usual approach is to treat the time-specific 
effects as fixed and estimate them as the coefficients of time dummies that are included 
in the regressors, with analysis then differing according to whether the individual- 
specific effects are treated as fixed or random. 

If both œ; and y, are fixed then the OLS estimator of @ in (21.53) is equivalent to 
regression of yit — Ji — Yr + Y on X; — X; — X; + X, where J; = T7! yi Yit, We = 
NSN, Yi and ¥ = (NT)! YN DOL] Yin with similar definitions for ¥;, X,, and 
X. This method of estimation is convenient if T is large. 

If instead both a; and y, are random then the error term will have a component y, 
that induces error correlation across individuals, whereas we have focused on inde- 
pendence over i. It can be shown that the GLS estimator can be computed by OLS 
estimation of y*, on a constant and x}, 


Vin = Vit — MPi — Ade + ABY, 
where Ji, Jr, and y have already been defined and x‘, is defined analogously to y*. 


For this and other results for the two-way effects model see Hsiao (2003) or Baltagi 
(2001). 
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21.8.5. Unbalanced Panel Data 


The discussion thus far has assumed the panel is balanced, meaning that data are 
available for every individual in every year. For panel data on different regions this 
is often the case. In contrast, for panel surveys of individuals there is usually a drop 
off or attrition over time in the proportion of individuals still answering the survey. 
Moreover, some individuals may miss one or more periods but return later, in some 
cases by design as in the case of rotating panels such as the CPS, where households 
are surveyed for four consecutive months, not surveyed for eight months, and then 
surveyed for another four months. Such panels where different individuals appear in 
different years are called unbalanced panels or incomplete panels. 

Let dj; be an indicator variable equal to one if the itth observation is observed and 
equal to zero otherwise. Then for the individual-specific effects model (21.3) the FE 
estimator is consistent if the strong exogeneity assumption (21.4) becomes 


E[uj:|aj, Xi1,...,Xi7, di1,..., dir] = 9, (21.54) 


and the RE estimator is consistent if additionally a; is independent of the other con- 
ditioning variables. The fixed and random effects estimators can then be applied to 
unbalanced data with relatively little adjustment. This should be clear from the ini- 
tial presentation of the estimators as OLS estimators in various models given in 
Section 21.2.2. For example, for the random effects model replace in (21.10) by 
4; =1—0,/ (T;02 +07)'/*, where T; is the number of observations for individual i 
(see Baltagi, 1985, and Wansbeek and Kapteyn, 1989). Davis (2002) considers multi- 
way random effects models. For the fixed effects model an individual observation must 
be observed at least twice in the sample and degrees of freedom must be appropriately 
adjusted. Baltagi (2001) gives a lengthy treatment of unbalanced panels. Economet- 
rics packages that estimate the more standard of the panel models presented in Chap- 
ters 21-23 usually automatically handle missing observations. 

At times it may be convenient to convert an unbalanced panel into a balanced panel, 
by including in the sample only those individuals with data available in all years. This 
obviously can greatly reduce efficiency because of the loss of many observations. Fur- 
thermore, if data are not randomly missing this can exacerbate potential problems of a 
nonrepresentative sample. 

One reason for missing data can be that although most variables are observed, at 
least one variable is not. For example, the nonresponse rate to income questions can 
be quite high. Rather than drop an entire observation because data for one regressor, 
income, is missing there may be efficiency gains to using the imputation methods 
presented in Chapter 27. 

Unbalanced panels require special methods if the reason for individuals dropping 
out of the sample is correlated with the error term, so that (21.54) does not hold. For 
example, those individuals with unusually low wages (after controlling for observed 
characteristics) may be more likely to drop out of a panel sample. The result is an 
unrepresentative panel that will lead to attrition bias if wage is the dependent variable. 
Consistent estimation requires use of sample selection methods extended to panel data 
(see Section 23.5.2). 
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21.8.6. Measurement Error 


Measurement error in regressors leads to inconsistent parameter estimates in cross- 
section regression models. If panel data methods are used that involve differencing of 
the data, the result may be a large increase in the inconsistency caused by measurement 
error depending on the assumptions made about the dgp. This is pursued in Chapter 26. 


21.9. Practical Considerations 


The various estimators presented in this chapter are easily implemented. The most 
foolproof method is to use the panel commands available in econometric packages 
such as LIMDEP, STATA, and TSP, all of which have the added advantage of usually 
handling unbalanced panels. Most estimators can alternatively be estimated using an 
appropriate pooled OLS regression on transformed data that requires only a cross- 
section package, though standard errors may then differ from panel package standard 
errors because the latter may ignore autocorrelation induced by transformation and 
may use different degrees of freedom. 

A weakness of panel commands in packages is that they currently compute standard 
errors based on restrictive distributional assumptions such as iid errors in the fixed 
effects models, and iid individual effect and iid errors in the random effects model. To 
compute the more robust standard error estimates presented in this chapter may require 
panel estimation with a panel bootstrap or estimation of an appropriate pooled OLS 
regression using an option to compute cluster-robust standard errors. 

In microeconometric analysis there is a fundamental distinction between models 
with and models without fixed effects. If a model without fixed effects is preferred 
it should be justified by passing a Hausman test. If this test rejects the random ef- 
fects model then it may still be possible to consistently estimate coefficients of time- 
invariant regressors using the instrumental variables methods presented in the next 
chapter. 


21.10. Bibliographic Notes 


Most textbooks, such as Greene’s (2003), include at least a chapter on panel data models. 
Wooldridge (2002) has several chapters that cover both linear and nonlinear panel models. 
Econometrics monographs on panel data include those by Hsiao (1986, 2003), Baltagi (1995, 
2001), Matyas and Sevestre (1995), M-J. Lee (2002), and Arellano (2003). The last three books 
place greater emphasis on the methods presented in Chapter 22 and 23. Diggle, Liang, and 
Zeger (1994, 2002) is a standard statistics reference. 


21.4 Mundlak (1978) wrote a classic article on fixed versus random effects models. Hausman 
(1978) used tests between these two models to illustrate his testing approach. 

21.6 Kuh (1959) and Hoch (1962) provide two early panel data applications to estimation of 
investment functions and of Cobb-Douglas production functions. These studies contrast 
use of within estimates using time-series variation and between estimates using cross- 
section variation. 
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21.10. BIBLIOGRAPHIC NOTES 


Exercises 


(Adapted from Baltagi, 1999) Consider the panel model yj; = œ + BX + Uit, 
where a and £ are scalars. 


(a) Show by appropriate subtraction that this model implies 
Vit — Y = B(Xit — Xi) + P(Xi — X) + (Uit — D), 


where Y= (NT)! X; Yi, K=(NT)' Dj Xit and Xi = TD, Xt. 

(b) For the corresponding unrestricted least-squares regression 

Yit — Y = P1 (Xit — Xi) + B2(Xi — X) + (Uit — U), 

show that the least-squares estimator of 6; is the within estimator and that 
of b2 is the between estimator. 

(c) Show that if uj = mi + Vit, where mi ~iid[0, 02] and vi ~iid[0, of], and the 
two are mutually independent across both / and t, the OLS and the GLS 
estimators are equivalent. 


Consider estimation of the fixed effects linear regression model yj; = a; + XB + 

eit, where a; are fixed effects possibly correlated with x;;. Stacking all T observa- 

tions for individual / yields Vi: =aje+ Xe + &; (see (21.29) for definitions). Con- 

sider the estimator 8 = [Z "4 XjJ/JX;J-! x Di- 1 X;J'Jy;, where Jisa Tx T 

matrix of known constants such that Je = 0. [Note that an example of J is 

Q=; — T'ee'.] 

(a) Provide a motivation for the estimator B. 

(b) Find E[A]. For simplicity assume that X; are fixed regressors and that ¢;; are 
iid [0, 2]. Is 8 unbiased for 3? 

(c) Find VIB]. For simplicity assume that X; are fixed regressors and that ej are 
iid [0, o°]. 

(d) Now suppose «;; are independent over i but correlated over t with V[e;] = Q;. 
Give V[@]. 

(e) Suppose that the effects a; are random (0, oĉ) rather than fixed. Would the 
estimator in this exercise be consistent? 


(Adapted from Baltagi, 1998) Consider the fixed effects, two-way error compo- 
nent panel data model 
Vit =@ +X ab + Hi +At+ Eit 


where a is a scalar, x;; is a k x 1 vector of exogenous regressors, 3 is a K x 1 
vector, u and A denote fixed individual and time effects, respectively, and ej; ~ 
iid[0, o°]. 

(a) Show that the within estimator of B, which is best linear unbiased, can be 
obtained by applying two within (one-way) transformations on this model. 
The first is the within transformation ignoring the time effects followed by the 
within transformation ignoring the individual effects. 

(b) Show that the order of these two within (one-way) transformations is unim- 
portant. Give an intuitive explanation for this result. 


Use a 50% random subsample of the wage-hours data in Section 21.3 


(a) Can £ be directly interpreted as a labor supply elasticity? Explain. 
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(b) For the following estimators: (1) pooled OLS, (2) between, (3) within, (4) first 
differences, (5) random effects GLS, (6) random effects MLE give (i) B (esti- 
mated coefficient of Inwg), (ii) default standard error, and (iii) panel bootstrap 
standard error with 200 replications. 

(c) Are the estimates of 6 similar? 

(d) Is there a systematic difference between default standard errors and panel- 
robust standard errors? 

(e) Will the pooled OLS estimator in part (b) be consistent for £ in a fixed effects 
model? Will the pooled OLS estimator be consistent for 6 in a random effects 
model? 

(f) Perform a Hausman test of the difference between the fixed and random 
effects (GLS) estimates of £ in this model. Do this manually using the earlier 
regression output with the default standard errors. What do you conclude 
and which model is favored? 

(g) Given the preceding evidence, do you believe that the labor supply curve is 
upward sloping? Explain. 
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Linear Panel Models: Extensions 


22.1. Introduction 


The previous chapter presented variants of the linear panel data model with a fixed 
or random intercept and regressors that are strongly exogenous. Now we move on to 
various extensions for linear models, with focus on relaxation of the strong exogene- 
ity assumption to permit consistent estimation of models with endogenous variables 
and/or lagged dependent variables as regressors. 

The use of instrumental variables is a standard method to handle endogenous re- 
gressors. It is much easier to obtain instruments with panel data than with cross-section 
data, since exogenous regressors in other time periods can be used as instruments for 
endogenous regressors in the current time period. The only complication is to first 
control for any fixed or random effects. 

Panel data permit regressors to additionally include lagged dependent variables, 
data unavailable with a single cross section. This permits estimation of dynamic mod- 
els that distinguish between persistence of earnings, for example, as the result of vari- 
ation around an unobserved individual-specific effect, as in Chapter 21, and persis- 
tence caused by the outcomes of previous periods directly determining the outcome 
of the current period. The estimators of Chapter 21 that control for individual-specific 
effects become inconsistent, however, if lagged dependent variables are regressors. In- 
strumental variables estimation using longer lags as instruments leads to consistent 
estimation. 

Panel data provide an excess of moment conditions available for estimation, owing 
to an abundance of instruments, and panel model errors are usually not iid. The nat- 
ural estimation framework is that of panel GMM, presented in detail in Section 22.2 
and illustrated with an application to estimation of the labor supply elasticity in Sec- 
tion 22.3. Further details on estimation with individual-specific effects and regressors 
that are endogenous or lagged dependent variables are presented in Sections 22.4 and 
22.5. The discussion is quite extensive due to the many possible variations that are 
covered. These include the presence of individual specific effects that may be fixed or 
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random, different exogeneity assumptions, and models that may be just-identified or 
over-identified. 

The remainder of this chapter considers other stand-alone topics that generally do 
not require reading of Sections 22.2—22.5. Models closely related to panel data models 
are presented in Sections 22.6—22.8, namely repeated cross-section data, differences 
in differences, and hierarchical models. 


22.2. GMM Estimation of Linear Panel Models 


The panel regression models in Chapter 21 restricted the scalar dependent variable y;, 
to depend on just the contemporaneous value of regressors x;,, even though potentially 
all of x;;, ...,X;r7 could be regressors under the Chapter 21 assumption of strong ex- 
ogeneity. This introduces the possibility of more efficient estimation using excluded 
regressors from other periods as instruments in the current period. 

Furthermore, regressors in other periods may be valid instruments for current- 
period regressors that are either endogenous or lags of the dependent variable. So in- 
struments are readily available to permit consistent IV estimation in situations where 
failure of the strong exogeneity assumption leads to inconsistency of the Chapter 21 
estimators. 

This section provides a general presentation of panel GMM estimation, a very use- 
ful framework for panel IV estimation that is used extensively throughout Sections 
22.2—22.5. Then we introduce the use of exogenous variables (regressors or instru- 
ments) in periods other than the current period as an instrument. Once this ground- 
work is laid it is a relatively minor adaptation to incorporate fixed or random effects, 
typically included in panel models. This is deferred to subsequent sections. 


22.2.1. Panel GMM 
Consider the linear panel model 
Yit = X b + Uit, (22.1) 


where the regressors x;; may have both time-varying and time-invariant components 

and may include an intercept. Here there is no individual-specific effect œ;, an as- 

sumption relaxed from Section 22.3 on, and x;; is assumed to include only current- 

period variables, an assumption relaxed in Section 22.5. Observations are assumed to 

be independent over i and a short panel with T fixed and N — oo is assumed. 
Begin by stacking all T observations for the ith individual, 


yi =XiG+u,, (22.2) 


where y; and u; are T x 1 vectors and X; isa T x K matrix with tth row x;,, so 


YiT XiT UiT 
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The model (22.2) defines a linear system of equations, so the results of Section 6.9.5 
for systems IV estimation with data independent over i are directly applicable. 

Assume the existence of a T x r matrix of instruments Z;, where r > K is the 
number of instruments, that satisfy the r moment conditions 


E[Z;u;] = 0. (22.3) 


The GMM estimator based on these moment conditions minimizes the associated 
quadratic form 


N t N 
Qn(6) = È ziu Wy » zs ; 
i=l i=l 


where Wy denotes an r x r weighting matrix. Given u; = y; — X;, some algebra 
yields the panel GMM estimator 


Pi N N “lay N 
Bpromm = (x xin) Wy (> zx) È xz) Wy (£ zim) ; 
i=l i=l i=l i=l 


The essential condition for consistency of this estimator is assumption (22.3). 

In many applications Z; is composed of current and lagged values of exogenous 
regressors. For example, suppose all regressors are contemporaneously exogenous. 
Then E[x;,u;;] = 0 implies (22.3) with Z; = [x),...x;,]. In this case the model is 
just identified and, since Z; = X;, Braum simplifies to the pooled OLS estimator of 
Chapter 21. If it is additionally assumed that E[X;;—1u;1] = 0, then x;;—1 is available as 
additional instruments for the itth observation, the model is over-identified, and more 
efficient estimation is possible using the PGMM estimator. 

The use of various exogeneity assumptions to form the instrument matrix Z; is 
detailed in Section 22.2.4. The analysis requires adaptation in panel data models with 
individual-specific effects œ;. This is illustrated in an empirical application in Sec- 
tion 22.3 and is dealt with explicitly in Sections 22.4 and 22.5. 


22.2.2. Panel-Robust Statistical Inference 


To express the distribution of the panel GMM estimator it is convenient to use more 
compact notation. Rewrite 


Bpomm = [X'ZWyZ'X]!X'ZWyZ'y, (22.4) 


where X’ = [Xi XA], Z = [Z1 Zh], and y'= [yi --- yy]. Then Bpomm is 
asymptotically normal with estimated asymptotic variance matrix 


ViGecual = [X’ZWy Z'X]-'X'ZWy (NS)W) Z’X[X'ZW yZ'X} |, (22.5) 
see Equation (6.97), where S is a consistent estimate of the r x r matrix 


1 & 
S =plim — Zuju.Z;, 22.6 
plim — ) 7 Ziuu; (22.6) 


i=] 
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and independence over i has been assumed. The essential assumption for this result is 
that N~!/?Z/u = N7!/2 > Ziu; z N{0, S]. A White-type robust estimate of S is 


PO E NA 
S= — J Ziz, (22.7) 


where the T x 1 estimated residual U; = y; — X;B. 

The estimate (22.5) yields panel-robust standard errors allowing for both het- 
eroskedasticity and correlation over time. Alternatively, the panel bootstrap could be 
used. For further discussion see Section 21.2.3 where the same issues apply. 


22.2.3. One-Step and Two-Step Panel GMM 


Different full-rank weighting matrices Wy in (22.4) lead to different systems GMM 
estimators, except in the just-identified case of r = K when the PGMM estimator 
simplifies to the IV estimator [Z’X]~!Z’y for any Wy. The discussion mirrors that in 
Section 6.4.2. The two leading choices of Wy are given here. 


One-Step GMM 


The one-step GMM or two-stage least-squares estimator uses weighting matrix 
Wy = [X ZZ]! = [ZZ], leading to 


Bosts = [X’Z(Z'Z)'Z/X]'X’L(Z/Z)'Z'y. (22.8) 


The motivation for this estimator is that it can be shown to be the optimal PGMM 
estimator based on (22.3) if u;|Z; is iid [0, 0717]. 

This estimator is called one-step GMM because given the data it can be directly 
calculated using Equation (22.8). It is called 2SLS as it can instead be obtained in 
two stages by (1) OLS of X; on Z;, yielding prediction x. and (2) OLS of y; on a 
An estimate of the variance matrix of Bice that is both panel and heteroskedasticity 
robust is that given in (22.5) with Wy = [Zz +, 


Two-Step GMM 


The most efficient GMM estimator based on the unconditional moment condition 
(22.3) uses weighting matrix Wy = Ss where S is consistent for S defined in (22.6); 
see Section 6.4.2 for the general result. Using Sin (22.7) yields the two-step GMM 
estimator 


Bosom = [X ZS Z' X] X ZS Zy. (22.9) 


Then (22.5) simplifies and V[Bsemm] = [IX ZNS Z' XT !. 
This is called two-step GMM since a first-step consistent estimator of 8 such as 
Poyiai is needed to form the residuals U; used to compute S. 


746 


22.2. GMM ESTIMATION OF LINEAR PANEL MODELS 


Efficiency Gains 


In this chapter the focus is on situations where Z cannot contain all of X, because 
of endogeneity of some components of X. Then panel GMM provides consistent esti- 
mates when OLS does not. Two-step GMM provides the most efficient estimator based 
on the moment condition E[Zu;] = 0. 

Even if regressors are strongly exogenous, two-step GMM has the attraction of be- 
ing more efficient than pooled OLS. To see this, suppose that X is strongly exogenous. 
Setting Z = X, the two-step GMM estimator simplifies to [X’X]~!X’y and there is no 
benefit to panel GMM. However, if instead Z equals X as well as some additional 
variables, such as powers of the regressors or regressor values in periods other than 
the current period, then the two-step GMM method is at least as efficient as OLS, with 
equality applying if the errors uj, are iid. 

Even more efficient estimators than Bzsgmm are possible, by widening the definition 
of Z;, by using the optimal moment condition based on E[u;|Z;] = 0, which need not 
be E[Z‘u;] = 0 (see Section 22.4.3), and by using additional moment restrictions. We 
shy away from calling two-step GGM the optimum GMM estimator, as in Section 
6.3, because it is only optimal given (22.3). 


Tests of Overidentifying Restrictions 


If there are r instruments and only K parameters to estimate, then panel GMM esti- 
mations leaves (r — K) overidentifying restrictions. From Section 6.3.8 this permits a 
test of overidentifying restrictions 


N N 
OIR = BA (NS)! È za f (22.10) 
i=1 


i=l 


where Ñ; = y; — Z; TET Sis given in (22.7), and independence over i is assumed 
but heteroskedasticity and correlation over t for given i is permitted. Note that Bəsomm 
must be used, not Bosis- 

This test statistic is distributed as x°(r — K) under the null hypothesis that the 
overidentifying restrictions are valid. If OIR is large then the overidentifying moment 
conditions are rejected and we conclude that some of the instruments in Z; are corre- 
lated with the error and hence are endogenous. 


22.2.4. Selection of Instruments 


The discussion so far has assumed the existence of a T x r matrix of instruments Z; 
that satisfies (22.3). Now we provide a lengthy discussion of how to obtain instruments 
in a panel setting. 

In cross-section models, endogenous variables are instrumented by variables that 
do not appear as regressors in the equation of interest. Such variables can also be used 
as instruments in the panel case. With panel models, however, the additional periods of 
data provide additional moment conditions and additional instruments that can easily 
lead to identification or overidentification of 8. 
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The number of moment conditions and instruments available expands as pro- 
gressively stronger assumptions are made about the correlation between u;, and Zis, 
s,t = 1,..., T. We consider the effect of progressively stronger exogeneity assump- 
tions, see Section 2.3, following M.-J. Lee (2002). The emphasis is on using exoge- 
nous components of the regressors as instruments more than once, but the technique 
also applies to more traditional instruments that are variables excluded from the 
regression (22.1). 


Summation Assumption 


An obvious procedure is to define Z; similarly to X;. Then 


/ 
Zii uil 
/ 
Zi2 Uiz 
Z=)} .|,w=|] . |, (22.11) 
/ 
ZiT UiT 


where z;; is r x 1 and E[Z/u;] = 0 if the summation assumption 


T 
E b nen =0 (22.12) 
t=1 


is satisfied. 

This assumption corresponds to that used in pooled OLS regression of yit on Xir, 
since if Z; = Xj; in (22.12) then the PGMM estimator defined in (22.4) simplifies to 
È: ZX) X; Ziyi. 

For this estimator to be feasible requires at least that the order condition be met, so 
that r > K. Under the summation assumption it is just as difficult to find instruments 
with panel data as it is with cross-section data. 


Contemporaneous Exogeneity Assumption 


A stronger and more natural assumption is the contemporaneous exogeneity assump- 
tion that 


E [Ziu] =0, t=1,...,T, (22.13) 


so that the instruments are assumed to be contemporaneously uncorrelated with the 
error term. 

This presents many more moment conditions, as in principle there as many as Tr 
moment conditions, where r = dim[z;;]. To use these we define 


Zy 0 0 


Uil 
4 : Ui? 
pal 2 iah (22.14) 
: x 0 : 
| ee 0 Zr UiT 
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where Z; is now Tr x T. The moment condition (22.3) holds, since E[Z/u;] = 0 by 
(22.13), but now (22.3) defines Tr moment conditions that can be used to estimate the 
K components of 6. 

This remarkable result of an apparent surfeit of moment restrictions comes about 
because of the implicit assumption that @ is time-invariant, so that each additional time 
period offers additional moment restrictions. 

The number of additional moment restrictions is reduced to the extent that G is time 
varying. In particular, the intercept is often permitted to vary over time by inclusion in 
x;, of (T — 1) time dummies d, ;, = 1 if t = s and 0 otherwise, for s = 2,..., T. Then 
the condition E[d; ituit] = 0 cannot be used as it duplicates the condition E[1 x uit] = 
0 implied by inclusion of an intercept in x;;. In the preceding example, if x1;; includes 
time dummies then there are only TK — (T — 1) moment conditions available. Any 
time-invariant regressors can be used only once as an instrument. 


Weak Exogeneity Assumption 


Moment condition (22.13) considers only contemporaneous correlation between in- 
struments and regressors. A stronger assumption is the weak exogeneity assumption 
or predetermined instruments assumption that additionally lagged values of the in- 
struments are uncorrelated with the current-period error, so that 


Elzu] =0, s<t, t=1,...,T. (22.15) 


Condition (22.15) permits z;1,..., Zi, to be instruments for u;,, though future values 
of Zis cannot be so used. The instrument Z; is structured similarly to (22.14), except 
that z;, is replaced by the expanded instrument vector [z;,, ..., z;,] that increases in 
size as t increases. 

Conditions of this sort arise in rational expectations models and in models of in- 
tertemporal decision making under uncertainty that lead to Euler conditions of the 
form E[u;;|Z;,] = 0, where Z; is the information set available at time t and an exam- 
ple of u;, is given in Section 6.2.7. If the information set includes current and past 
values of z;; then E[u;,|z;;] = 0, s < t, leading to (22.15). 

More generally these conditions become relevant in dynamic models with lagged 
dependent variables as regressors (see Section 22.5). In some instances contempora- 
neous correlation is not ruled out, so that the inequality s < t in (22.15) is replaced by 
S<t. 

Note that time-invariant instruments can only be used once. Thus if Z; = [Z1; Z2ir], 
then Z; and Z9;1,..., Z2; are available as instruments. 


Strong Exogeneity Assumption 


A stronger assumption than weak exogeneity is the strong exogeneity assumption 
that future values of instruments are also uncorrelated with the current period error, so 
that 


E[zui;]=0, s,t=1,...,T. (22.16) 
Then current, past, and future values of Z;s are valid instruments for ttir. 
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This assumption was maintained for the regressors x;, throughout Chapter 21, since 
E[uj;|X;1, .--, Xir] = 0 implies E[u;,|x;,] = 0, 1 < s < T, and hence E[x;,u;,] = 0. It 
may be appropriate for static models, but for dynamic models at most weak exogeneity 
of instruments can be assumed. 


Condition (22.16) permits z;;, ..., Zir to be instruments for u;,. The instrument Z; 
is structured similarly to (22.14), except that z}, in (22.14) is replaced by the expanded 
instrument vector [z;,, ..., Ziy]. 


As for the weak exogeneity case, time-invariant instruments can be used only once. 
If Zi: = [Z1; Zo;;] then T (rt, + Trry) moment conditions are available, where ry; and 
rry denote the numbers of time-invariant and time-varying instruments. 

The extraordinary number of moment conditions, as many as rT’, is due to exclu- 
sion restrictions implicitly made in the panel model (22.1). For simplicity suppose all 
components of x;; are strongly exogenous and we wish to use these as instruments 
whenever possible. In general y;, could depend on the regressors in all time periods, 
Xil, -< <, Xir. In contrast, the panel model y;, = x}, + uj, with E[x;,u;;] = 0 excludes 
all but x;, from the model for y;;. The strong exogeneity assumption that E[x;,u;;] = 0 
then permits the excluded regressors x;, , $s Æ t, to be used as instruments in addition 
to X;;. 


Redundant Instruments 


If Z;; is varying over both i and ¢ then its lags and leads can also be used as an in- 
strument, depending on the exogeneity assumptions made. For the itth observation 
the available instruments are z;, under contemporaneous exogeneity, Z;;,..., Ziy under 
weak exogeneity, and z;,,..., Zir under strong exogeneity. This makes identification 
possible using only exogenous regressors as instruments. Only under the summation 
assumption are the difficulties of finding valid instruments comparable to those in the 
cross-section case. 

In practice, however, there are not as many available instruments as the preced- 
ing discussion suggests. Time-invariant instruments z;, = Zz; can be used only once, 
since then z;; = Zis for all s and t. For example, this is the case for an intercept or 
for a race or gender indicator. If the instrument is a regressor and lagged values of 
the regressor appear in the model then the number of available instruments is reduced. 
Time-varying instruments that vary in a systematic way may also not be available in all 
periods. Thus instruments that are the product of time dummies and a time-invariant 
regressor should be included only once if a complete set of time dummies is used. 
Examples include time dummies and time dummies interacted with race or gender in- 
dicators. Instruments that are a linear function of time should be used only once. For 
example, if year is an instrument then lagged years should not also be used. This com- 
ment does not apply to age, which increases linearly for each individual but varies 
across individuals. 

It is clearly easy to inadvertently use redundant instruments. The panel GMM 
estimators are still feasible and the usual results are valid if there are still sufficient 
nonredundant instruments. For example, if r instruments are used and two of these 
are redundant the model is still estimable provided r > K +2 as Z’X is still of full 
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rank K. Singularity problems in GMM estimation may arise if too many redundant 
instruments are used, leading to an underidentified model. Even if the model remains 
overidentified, the degrees of freedom in a test of overidentifying restrictions will be 
reduced if some instruments are redundant. 


Weak Instruments 


Weak instruments, not to be confused with weak exogeneity, were introduced in Sec- 
tion 4.9. There is no well-established formal test of weak instruments. Standard R? 
and F-statistic diagnostics are given in Section 4.9. It is the incremental explanatory 
power of the instruments that matters. So a partial R? that controls for exogenous re- 
gressors that are also in the instrument set should be used. Moreover, whereas the en- 
dogenous regressor is regressed on all instruments, the F’-statistic should be one of the 
overall significance of the subset of the instruments that are not exogenous regressors. 

Since the errors here are not iid, the F’-statistic should be based on panel robust stan- 
dard errors. It can be calculated as W/r*, where W is the Wald chi-square test statistic 
for exclusion restrictions given in Section 7.2.7 and r* is the number of instruments 
that are not regressors in the original model. 


22.2.5. Computation of Panel GMM Estimators 


The moment conditions discussed in the preceding section provide the instrument ma- 
trix Z;. Then, given Z;, one can estimate 8 by Boas defined in (22.8) or by OSONMI 
defined in (22.9). 

The 2SLS estimator is easier to implement than the two-step GMM. Consider esti- 
mation under the summation assumption, in which case Z; is defined in (22.11). Then 
Bosrs is given in (22.8), where Z/X = >, Z/X; = J; J., zirx’, and similar algebra ap- 
plies for the other cross-products. This yields the standard textbook formula for 2SLS, 
except that summation is over both i and t. Thus Boas can be obtained by 2SLS 
regression of y;; on X;; using a cross-section 2SLS package. Panel-robust standard 
errors can then be obtained using a cluster-robust option that permits clustering on i, 
or by a panel bootstrap that resamples over i rather than both i and t. The approaches 
are similar to those for pooled LS given in Section 21.2.3, which provides additional 
detail. 

For assumptions other than the summation assumption one can still use a cross- 
section 2SLS package by appropriately defining the instrument matrix Z;, which then 
has a more complicated form. For the contemporaneous exogeneity assumption, Z; is 
defined in (22.14). This is in the same form as (22.11) if the rth row in (22.11), zi, is 
replaced by 

[0.,---0, za On ---0.], (22.17) 


Tt-1 Tio 


where r, = dim[z;,] and 0,, denotes an rs x 1 vector of zeros. pape for the weak 
exogeneity assumption, Z; is as in (22.11) with the rth row in (22.11), z’,, replaced by 


it? 


[0-0 yO, ---01], (22.18) 


Tt+1 
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where (zi) = [zi ,...z;,] and rs = dim[z;,], and for the strong exogeneity assumption, 
Z; is as in (22.11) with the rth row in (22.11), z,,, replaced by 


Ub? 


[0,,--- 0, iY 0,,,,-°-0,,], (22.19) 
where (zi y = [z;;... Zip] and r; = dim[z/]. A practical example of generating the 
instruments is given in Section 22.3. 

In practice there can be too many moment conditions. For example, with 10 pe- 
riods of data and 5 time-varying regressors the strong exogeneity assumption yields 
as many as 5 x 107 = 500 moment conditions (and the preceding row vector has 500 
entries) with only 5 parameters to estimate. The marginal value of an instrument may 
be very slight, because of increasing multicollinearity among the instruments, leading 
to a situation of weak instruments. Good practice is to treat time-varying instruments 
that vary little over time as time-invariant. For example, use only the data for the first 
period as an instrument. Even instruments that vary considerably over time might be 
used for only a few periods rather than in all possible periods. 

Computation of the more efficient Bz2sgmm is not possible using only a 2SLS pack- 
age. Instead, either more specialized software is needed or the estimator needs to be 
programmed using a matrix language algorithm. 

Table 22.1 provides a summary of the four exogeneity assumptions and the resulting 
valid instruments. 


22.2.6. Variations on GMM Estimation 


Although sein is more efficient than EA several studies find it to have greater 
finite-sample bias than Bsus, especially when r is much greater than K. For explana- 
tion see the discussion of finite-sample bias of optimal GMM in Section 6.3.5. 

One approach is to be judicious in the use of instruments, though then potential 
efficiency gains due to additional instruments are lost. 

Several authors have proposed alternative GMM estimators that may be less likely 
to be biased in finite samples. Many of these are presented in Section 6.4.4 and are 
used in the panel study by Ziliak (1997). 


Table 22.1. Panel Exogeneity Assumptions and Resulting Instruments 


Exogeneity Assumption Moment Condition Instrument Vector“ 
Summation E eS, ZirUtir | =0 [Zit] 

Contemporaneous E [z;;u;;] = 0, all t [(0..---O. Zi 0. 0.) 
Weak E [Zisui] = 0, s < t, allt [0 L (ziy (US ee | 
Strong E[zj,u;;] = 0, all s and t [0 ---0) (zi O, °°° 91 
^ The instrument vector is the tth row of Z; in (22.11); (zi) = [zi ,...2;,], (z7 = [z;;... Zip]; andrs = dim[z; ] 


or dim[z},] or dim[z? ; 
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22.2.7. Chamberlain’s Optimal Distance Estimator 


Consider estimation of the individual-specific effects model 
Yit = Qi + Xb + tir, (22.20) 


when regressors are strongly exogenous as in Chapter 21. In Sections 21.2.3 and 21.6.1 
methods to obtain panel-robust standard errors for the within estimator were presented. 

If panel-robust inference is warranted, because ¢;; are not iid, then the estimators 
detailed in Chapter 21 are actually inefficient. More efficient estimation is possible us- 
ing optimal GMM applied to an overidentified model. Here x;,, s 4 t, are available as 
additional instruments and GMM can be applied to a transformed model if elimination 
of a; is necessary (see Section 22.4.2). The efficiency improvement is analogous to 
that for cross-section data with heteroskedasticity (see Section 6.3.5). 

Chamberlain (1982, 1984) proposed the following more efficient estimator. The 
model (22.20) can be stacked to yield 


yi = ea; + (Ir Q B')x; + Uy, (22.21) 


where e = (1, 1,..., 1) isa T x 1 vector of ones, x; = [x;,...x;,] isa TK x 1 vec- 
tor, and y; and u; are T x 1 vectors. Equation (22.21) makes clear the restrictions 
that are implicitly made in static models that specify that y;, depends only on con- 
temporaneous x;,. Chamberlain used linear projection arguments that rely on weaker 
assumptions than those of conditional expectation. Let 


E" [aix] = u + È, AX = +A, 
where E* denotes linear projection. Given E[u; |œ;, x;] = 0, (22.21) implies 
E*[y;|x;] = eu+(Ir ® 8'+ eA^x;. 


This imposes restrictions on the unrestricted linear projection E*[y;|x;] = mo + 7’x;, 
specifically that m — Ir @ 8’ + eX’ = 0. 

Rather than use GMM, Chamberlain proposed the following two-step procedure. 
First, obtain 7 by multivariate OLS regression of y; on intercepts and x;. Second, 
obtain the optimal MD estimator (see Section 6.7) that minimizes 


On(G, X) = (Vec[#—Iy @ 8' — ed']) Wy (Veci -Ir @ 8' — eX']), 


where the optimal weighting matrix Wy = (V[Vec[7#]])~!. This yields estimator B 
that is more efficient than OLS estimation of (22.20) if u;; is heteroskedastic. 

Minimun distance estimation has been supplanted by GMM; see Arellano (2003, 
pp. 22-23) and Crépon and Mairesse (1995) for comparison of Chamberlain’s MD 
estimator with GMM. However, Chamberlain’s approach of obtaining moment restric- 
tions via exogeneity assumptions and assumptions on the individual effects has had 
a big impact on the panel literature. His MD estimator is also used for estimation of 
covariance structures (see Section 22.5.4). 
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22.3. Panel GMM Example: Hours and Wages 


We return to the hours—wages example of Section 21.3. Unlike as in Chapter 21 regres- 
sors are now permitted to be endogenous, and unlike as in Section 22.2 an individual- 
specific fixed effect is included. Estimation is by the IU methods of Section 22.2, after 
first-differencing to eliminate the fixed effects. 

The regression model is 


Inhrs;, = a; + 6, lnwg;, + B,kids;, + B3age;, + Byagesq;; + Bsdisab;; + Uit, 


where interest lies in the intertemporal substitution wage elasticity of labor supply, 6 ,, 
the coefficient of Inwg, and the additional regressors are number of children, age, age 
squared, and an indicator for disability. 

MaCurdy (1981) derived this relationship using a life-cycle labor supply model un- 
der uncertainty. The model is then a “A-constant” model where œ; here equals à;, a 
multiple of the marginal utility of initial wealth that is time-invariant but will differ 
across individuals. Since 4; depends on variables and constraints it needs to be treated 
as a fixed rather than random effect. The labor supply literature presents several meth- 
ods for controlling for this fixed effect. 

One method, discussed further in Section 22.4.2, is to first difference the regression 
model, yielding 


Alnhrs;; = 6, Alnwg;; + B,Akids;; + 3 Aage; + B,Aagesq;, + B;Adisab;, + Auj;. 
(22.22) 


Estimation by OLS is then consistent for ( if all regressors are exogenous. Note that 
this differencing induces serial correlation in the error even if u;; are iid, so panel- 
robust standard errors should be used. 

Ziliak (1997) instead permitted Inwg;, to be contemporaneously correlated with 
Uit, because of measurement error in wage or because of kink points in the budget 
constraint. Then the OLS estimator of (22.22) is inconsistent. 

Ziliak proposed IV estimation using suitably lagged regressors as instruments. As- 
sume that past wages are uncorrelated with the error, so that Inwg is weakly exogenous 
aside from being contemporaneously correlated with the error. Then E[Inwg;,u;;] = 0 
for s <t—1 implies that for the differenced model error E[Inwg;,Au;;] = 0 for 
s < t — 2, so Inwg lagged two or more periods may be used as an instrument in the 
first-differences model. Note that this means that at least three periods of the original 
data are needed to identify 6. 

Ziliak’s study focused on the properties of panel GMM estimators with endogenous 
regressors, so he treated all the regressors in (22.22) as endogenous and used as in- 
struments lags of one or more periods in the levels of the other four regressors. For 
simplicity an intercept and time dummies, individual-invariant instruments that can be 
only used once, were not included. Results here change little with inclusion of an in- 
tercept as the dependent variable is in differenced form. Since Inwg; ;—2 is always used 
as an instrument the first two years are dropped and only the eight years 1981-1988 
are used to estimate (22.22). 
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22.3. PANEL GMM EXAMPLE: HOURS AND WAGES 


Base Case Stacked 
OLS 2SLS 2SGMM 2SLS 2SGMM 

By 0.112 0.209 0.547 0.543 0.330 
Panel se (.096) (.374) ( .327) (.209) (.110) 
Het se [.079] [.423] [-] [.226] [-] 
Default se {.023} {.389} {-} {.169} {-} 
RMSE 283 .296 307 307 298 
Instruments 5 9 9 72 72 
OIR Test - - 5.45 — 69.51 
dof - - 4 — 67 
p-value - - 244 - 393 
N 4256 4256 4256 4256 4256 


“ Differenced regression uses annual data from 1981-1988 for 532 men. Reported are 6, the coefficient of A 


Inwg, and three estimated standard errors: panel robust in parentheses, heteroskedastic robust in square brackets, 
and usual default estimates that assume iid errors in curly braces. All regressions additionally include Akids, 
Aage, Aagesq, and Adisab as regressors but their coefficient estimates are not reported. The instruments are 
Inwg lagged twice and kids, age, agesq, and disab lagged both once and twice. For the base case there are 9 
instruments and for stacked instruments there are 8 x 9 = 72 instruments. RMSE is the root mean square error 
of the residual. OIR is the over identifying restictions test statistic, dof is the degrees of freedom, and p-value 
is the p-value for this test. 


Table 22.2 presents a small subset of the many results given in tables 1 and 2 of 
Ziliak (1997). For completeness various standard error estimates are given but the 
panel-robust standard errors should be used. 


OLS: The column OLS reports OLS estimation of (22.22). The labor supply elasticity 
of 0.112 differs a little from the estimate of 0.109 in the First-Diff column of Table 
21.2 as here the four demographic variables are also included as regressors and an 
additional year of data has been dropped. Because first differences are modeled the 
model fit is poor, and the R? with additional inclusion of an intercept is 0.006. 


2SLS with Base-Case Instruments: The base-case instruments use Z; defined 
in (22.11), where z;, has nine entries: Inwg; 2, kids; ;-1, age; ;—1, agesq; ;—1, 
disab; ;-1, kids; 2, age;,-2, agesq;;-2, and disab; ,-.. The model is then overi- 
dentified with nine instruments and five parameters to estimate. The 2SLS estimate 
of 6; is much less precise than the OLS estimate, with standard error increasing 
fourfold from 0.096 to 0.374. For the other regressors, not reported, the efficiency 
loss is much less. 


2SLS with Stacked Instruments: The base case is GMM based on the nine moment 
conditions E, ZirUit] = 0. The stacked instruments instead use 72 (= 8x 9) 
moment conditions E[z;,u;,] = 0, t = 3,..., 10, where z;, is as in the base case. 
Then use Z; defined in (22.14), where here Z; is 8 years by 72 instruments. The 
tth row of Z; is given in (22.17), where z;; here is the 9 x 1 column vector of in- 
struments for the base case. To construct the instruments first generate 72 variables 
ztj equal to zero for all i and t, where t denotes the year and j denotes the jth 
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instrument. Then replace zs ji; by zi;,; if t = s but leave zsj;; = 0 if t A s. For ex- 
ample, if t = 3 (the third year) set z35 equal to disab; 2 if the fifth instrument is 
disab; ;_; and keep zt5 equal to zero for t # 3. The 2SLS estimates can then be 
obtained by standard 2SLS regression of Alnhrs;,; on the five regressors in (22.22) 
with these 72 constructed variables as instruments. Using the expanded instruments 
we have that the standard error of the 2SLS estimate falls from 0.374 to 0.209 and 
is only twice that of the original OLS estimate. 


Two-step GMM: The two-step GMM estimates in Table 22.2 differ from those in 
table 1 of Ziliak (1997) as a panel-robust estimate of S defined in (22.7) is used here 
to form the weighting matrix, whereas Ziliak used the heteroskedastic-robust S= 
N! >>, Q? ZZ.. As expected, the two-step GMM estimator is more efficient than 
2SLS, with standard error falling from 0.374 to 0.327 with base-case instruments 
and from 0.209 to 0.110 with stacked instruments. This last standard error is not 
much larger than that for OLS. 


Test of Overidentifying Restrictions: The test statistic for overidentifying restrictions 
is given in (22.10). From Table 22.2 for both base case and stacked instruments the 
test statistic has p-value much higher than 0.05, so the restrictions are not rejected 
and we conclude that the overidentifying instruments are valid instruments. 


Test of Weak Instruments: Diagnostics for weak instruments were presented in Sec- 
tion 22.2.4 and Section 5.9. Since none of the regressors appear in the instrument 
set the overall F-statistic from the first-stage regression is used rather than a sub- 
set of regressors F-statistic. For the base-case instruments, regression of Alnwg on 
the nine instruments and a constant term yields panel-robust F = 2.80, and similar 
regression for the 72 stacked instruments yields F = 1.90, indicating finite-sample 
bias is very likely. Similar regressions for Akids, Aage, Aagesq, and Adisab, re- 
gressors in (22.22) that are also being treated here as endogenous, yield F > 8.5 
in all cases. Shea’s partial R? (see Section 4.9.1) is 0.0036 for Alnwg and exceeds 
0.075 for the other four endogenous regressors. The weak instruments problem is 
therefore due to the problems of finding a good instrument for Alnwg. 


Efficiency Gains: In this example panel GMM estimators were used to control for 
endogeneity. However, even if all the regressors are assumed to be strongly ex- 
ogenous, panel GMM is still attractive as it is more efficient than OLS unless the 
errors uj; are iid; see the discussion after (22.20). As an example, the panel two-step 
GMM estimator with instrument set the base-case instruments plus the five original 
regressors in (22.22) yields Bi = 0.016 with a standard error of 0.076, lower than 
the OLS standard error of 0.096. 


22.4. Random and Fixed Effects Panel GMM 


We now augment the panel data model (22.1) by including a time-invariant additive 
individual-specific effect œ;, so 


Yit = Qi +X B + Ein. (22.23) 
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Then the error term in (22.1) is now modeled as u;, = œ; + €;;. For simplicity the same 
notation is used for both fixed and random effects models, so in the case of random 
effects model the common intercept u in Section 21.7 is subsumed into x’, 3. 

Some components of the regressors x;; are assumed to be endogenous, with 
E[x;;(@; + €;;)] 4 0, so that the OLS estimator of B is inconsistent. In this section 
we propose IV estimators that yield consistent estimates of 8 in a variety of settings, 
including fixed effects, random effects, a hybrid of the two, and systems of equations. 


22.4.1. Random Effects or Fixed Effects? 


Recall from Chapter 21 that the individual-specific effect æ; can be viewed as random 
in both the FE and RE models. This random variable œ; was independent of x;; in the 
RE model but correlated with x;; in the FE model. For the RE model all coefficients 
are estimable, whereas in the FE model coefficients of time-invariant regressors are 
not estimable as consistent estimation requires elimination of œ; and the time-invariant 
regressors by differencing. 

In this chapter with endogenous regressors we view a model to be a random effects 
model if instruments Z; exist that satisfy E[Z;(a; + ¢;;)] = 0. Then the methods of 
Section 22.2 will permit consistent estimation of all regression parameters. If instead 
it is possible only to find instruments such that E[Z;£;;] = 0, but E[Z;a;] 4 0, we view 
the model to be a fixed effects model. Then œ; must be eliminated by differencing, in 
which case only the coefficients of time-varying regressors will be identified. 


22.4.2. IV for Fixed Effects Models 


The various differencing operations given in Section 21.2 applied to (22.23) lead to a 
transformed model of the form 


~ aot = 
Yit = Xb + Eir, 


where the tilda denotes a differencing transformation that eliminates œ;, and leading 
examples are given in the following. Upon stacking we get 


F =XB +E. (22.24) 


If E[x;;¢;,] 4 0 then E[x;,é;,] 4 0 and LS estimation of (22.24) leads to inconsistent 
estimates. 

We now consider IV estimation, assuming existence of instruments Z; that satisfy 
E[Z;é;] = 0. Then panel GMM estimation (IV, 2SLS, or 2SGMM) of (22.24) with in- 
struments Z; yields consistent estimates of the coefficients of time-varying regressors. 
Panel-robust standard errors can be computed as discussed in Section 22.2.2. 

One way that instruments may be obtained is through logic similar to that in the 
cross-section case. A valid instrument is a variable correlated with the regressor but 
not the error, yet is also one that can be excluded from the right-hand side of (22.23). 
Another way to obtain instruments, emphasized here, is through use of exogenous 
regressors in periods other than the current period, using the exogeneity assumptions 
detailed in Section 22.2.4. 
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The primitive assumptions for instrument availability are those on correlation be- 
tween z;, and e;,. However, here it is correlation between z;, and the differenced er- 
ror @;; that matters. In general, differencing, necessary to eliminate the fixed effect, 
reduces the number of available instruments. Some differencing operations lead to 
greater loss than others and can even lead to inconsistent IV estimation. We consider 
three differencing operations with focus on weakly exogenous instruments. This 
can be a more realistic assumption in practice, especially for application to dynamic 
models. 


IV for the First-Differences Model 


The first-differences IV estimator is the IV or 2SLS or panel GMM estimator of the 
first-differences model 


Yit — Yir—1 = Bi — X21) B + (Eit — Sir-1), = 2,...,T. (22.25) 


The weak exogeneity assumption that E[z;,¢;,] = 0 for s <t implies E[z;;,(¢i; — 
€;+-1)] = 0 for s < t — 1. First differencing therefore shortens the time series on the 
available instrument set by one period, so that only Z;;-1, Z;,,-2,... are available as 
instruments. Assuming weak exogeneity, these yield a consistent IV estimator of 6. 

The use of lagged regressors as instruments was first proposed by Anderson and 
Hsiao (1981) in the context of dynamic panel models and was expanded upon by Holtz- 
Eakin, Newey, and Rosen (1988) and Arellano and Bond (1991) (see Section 22.5.3). 
Section 22.3 provided a detailed empirical example of this approach. 


Note that one can instead use transformed instruments Zis = Azjs = Zis — Zi,s—1, 
s < t — 1. However, there is no gain, since using Az; ;-1,..., AZji2, Zi1 is equivalent 
to using Zi 7-1, ..., Zi2, Zi1 aS instruments, and only z;; and not Az;; can be computed 


if data begin in period 1. 


IV for the Within or Mean-Differenced Model 


The within IV estimator is the IV or 2SLS or panel GMM estimator of the within 
model or mean-differenced model 


Yit — Ji = (Xir — Xi B + (Eir — Zi). (22.26) 


Then E[Z;s£;] = 0 for s < t no longer implies E[z;,(¢;, — &;)] = 0 even for s much 
less than t. To see this suppose that E[z;,¢;,] 4 0 for s > t. Then E[z;,é;] Æ 0 for all s 
since &; = T7! >= £i includes past £;;, which are correlated with Z;s. 

Thus IV estimation of the within model leads to inconsistent estimation of ( if the 
instruments are weakly exogenous or if they satisfy the even weaker assumptions of 
contemporaneous exogeneity or the summation condition. The within transformation 
can only be used if the instruments are actually strongly exogenous. 
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IV for the Forward Orthogonal Deviations Model 


An alternative method to first differences, one that also requires that instruments be 
only weakly exogenous rather than strongly exogenous, was proposed by Arellano 
and Bover (1995). We also present this method, even though first differences are used 
much more. 

For the stacked model (22.2) for the ith observation, the first-difference transfor- 
mation yields model Dy; = DX; 6 + De;, where D is a (T — 1) x T matrix with en- 
try Dis, tf =1,...,7 —1, 5 =1,..., T, equal to minus one if s = t, equal to one if 
s = t + 1, and equal to zero otherwise. If ¢;,; are iid the transformed error is MA(1) 
and V[Du;] = o”DD’. The GLS estimator then premultiplies De; by (DD’)~'/”, or 
premultiplies e; by (DD’)~!/7D. This yields a transformed model of the form (22.24) 
where the tilda denotes premultiplication by (DD’)~'/D. 

If the upper triangular Cholesky factorization is used to obtain (DD’)~!””, then this 
yields the forward orthogonal deviation model 


(vit — YE) = Xir — KAY B+er(eir — EF) (22.27) 


(see Arellano, 2003, p. 17), where c? = (T — t)/(T — t + 1) and the superscript F 

denotes that only future values are used to form the average. For example, 5f = (T — 
= T 

t) eer Yis- 

The transformation is called orthogonal deviations because the transformed errors 
CtlEit — EF ) have unit variance and are uncorrelated. The adjective forward is added 
as the transformed error depends only on current and future values of the original 
error. An OLS estimation of (22.27) yields the within estimator of Chapter 21, so the 
orthogonal deviations transformation is optimal if indeed £;; are iid. 

The forward orthogonal deviations IV estimator is the IV or 2SLS or panel 
GMM estimator of the model (22.27). For weakly exogenous instruments, E[z;;¢;;] = 
0 for s <t implies E[z;,(¢;, — ar )] = 0 for s < t. Forward orthogonal deviations 
therefore lead to no loss in the number of available instruments. The transformation is 
usually not applied to the instruments as (Z;; — ze ) involves future values of z;, that in 
many applications are correlated with ¢;,. 


22.4.3. IV for Random Effects Models 


The model stacked for the ith observation is 
yi = X; b + ea;+¢;, 


where eis a T x 1 vector of ones. Consistent but inefficient estimates can be obtained 
by directly applying the panel GMM estimators of Section 22.2 given instruments Z;, 
obtained through exclusion restrictions or through appropriate exogeneity restrictions, 
such that E[Z;(ea; + €;)] = 0. Here we go further and consider more efficient esti- 
mation that, as in Chapter 21, controls for error correlation over time given the error 
components model uj; = a; + €it. 
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IV Estimation of Transformed Model 


Assume that the instruments Z; satisfy E[u;|Z;] = 0 and V[u;|Z;] = ;, where Q; 
has the same form as the standard RE model with diagonal entries ož + o? and off- 
diagonal entries o2. Note that this is a stronger assumption than E[Z/u;] = 0 and will 
therefore place restrictions on available instruments. 

Given the conditional moment condition E[u;|Z;] = 0, from Section 6.3.7 the opti- 
mal unconditional moment condition is 


EIZ; O'u] = EROF ZVO u] = 0. 


This leads to GMM estimation in the transformed system y* = X¥ 6 + u; with trans- 
formed instruments Z*, where the asterisk denotes premultiplication by the T x T 
matrix o” ? or a consistent estimate 9;" s 

From Section 21.7.1 premultiplication by ao” leads to the model 


Vir — Ai = (Xir — ARD B + {A — Day + (en — 28}, (22.28) 


where ^ is a consistent estimate of 4 = 1 — o,/,/o2 + To2. The random effects IV 
estimator is the IV or 2SLS estimator of this model with transformed instruments 
Ti = (Zit — Z), or equivalently with instruments z;, — Z; and Z;. 

This method requires a consistent estimate 4 of à. For o? we use G2= 
by Z /N(T — 1), where €;,; is the residual from within IV regression of y; — Y; on 
(Xi: — X; ) with instruments (Z;; — Z;) (see (22.26)). Also, (o? + To?) can be estimated 
by >); ae /N, where ū; is the residual from the between IV regression of y; on X; with 
instruments Z;. The resulting IV estimator of 8 is called the error components 2SLS 
(EC2SLS) estimator by Baltagi (1981). 

These results are dependent on specificaton of a particular functional form for Q;. 
The results in Section 22.2.2 permit inference that is robust to misspecification of 
Q;, using (22.5) where y, X, Z, and Wy = [ZZ]! are replaced by the transformed 
variables in (22.28). 

A more important restriction is that this method can only be used if the original 
instruments are strongly exogenous. Here consistency requires that E[Z/Q; ‘u] = 
0, a much stronger assumption than E[Z’u;] = 0, which essentially requires that 
E[u;|Z;] = 0. For example, suppose E[z;,«;] = 0 for all £ whereas E[z;,¢;;] = 0 for 
s < t but E[z;,e;,] 4 0 for s > t . Then E[z;,é;] 4 0, leading to correlation of instru- 
ments with the error term in (22.28). 


22.4.4. IV for the Hausman—Taylor Hybrid Model 


A leading example of endogeneity involves regressors correlated with the individual- 
specific effect w;. This leads to inconsistency of the RE estimator of Chapter 21. An 
obvious solution is to instead use the within (or fixed effects) estimator, which is con- 
sistent. However, then the coefficients of time-invariant individual regressors cannot be 
identified. This defeats the purpose of many panel studies — estimation of the effect of 
time-invariant regressors, such as the effect of the level of schooling in a postschooling 
earnings regression. 
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Hausman and Taylor (1981) considered the following variant of (22.23): 
Yir = Xt + Kuba + Wi + Wu Va + Qi + Eis, (22.29) 


where some regressors are assumed to be correlated with œ; whereas others are not, 
and w is introduced to denote time-invariant regressors. Specifically, x1;; and wi; are 
uncorrelated with a; but x2;, and wz; are correlated with a;. All regressors are assumed 
to be uncorrelated with ¢;;. In this model the œ; can be viewed as a hybrid of random 
and fixed effects. 

Hausman and Taylor (1981) proposed making use of the time-varying exogenous 
regressor X1;; in two ways: to estimate 3, and as an instrument for w2;, permit- 
ting estimation of y. Then y is identified if the number of time-varying exogenous 
regressors equals or exceeds the number of time-invariant endogenous regressors. 
Amemiya and MaCurdy (1986) proposed a more efficient estimator that uses x,;, in 
(T + 1) ways: to estimate G, and as T instruments for w2;, permitting identification 
if dim[w>;] > Tdim[x;,]. This approach to obtaining instruments from exogenous re- 
gressors in periods other than the current period has already been discussed in detail 
in Section 22.2.4. 

Various projections, some equivalent, can be used to generate suitable instruments. 
Breusch, Mizon, and Schmidt (1989) provided a simpler presentation and projection 
that permits estimation using a 2SLS package. 

First consider consistent but inefficient estimation that ignores the panel correlation 
structure of (a; + €;,). The within transformation eliminates correlation with a;, so 
Xi = X2ir — Xz; can be used as instrument for endogenous x2;;. The instrument for x1;; 
is similarly X1;;, rather than the more obvious x1;;. Then X4; is used as an instrument 
for endogenous w2;, whereas the exogenous W; is used as an instrument for itself. 

Now consider efficient estimation under the random effects assumption that the 
components a; and ¢;; are homoskedastic. Then from (22.27) the random effects 
differencing transformation (see 22.28) leads to 


Yit = XB, + Xarba + Wiii + WV + Vir, (22.30) 


where, for example, Xii = Xii — Xii, where an estimator for the scalar à has been 
presented at the end of the preceding section. The Hausman-—Taylor estimator is equiv- 
alent to IV estimation of (22.30) using as instruments X1;;, X2;;, W1;, and X,;. The ex- 
ogenous time-varying regressors X);; = X1; + X1; are used as instrument twice, with 
the within difference X,;, used as an instrument for x,;, and the time average X; used 
as an instrument for w2;. The estimator of Amemiya and MaCurdy (1986) instead uses 
as instruments Xj;,, Xo;;, Wi; and x;,,...,X1;7, SO that the entire history of x); rather 
than just the time average is used as an instrument. This requires that E[x,;,a;] = 0 for 
t=1,..., 7, a stronger assumption than E[X,;a;] = 0 (see Section 22.2.4). Breusch 
et al. (1989) proposed an even more efficient estimator using X;,, s Æ t, as additional 
instruments. 

The major limitation of this approach is that it requires specification of which re- 
gressors are either correlated or not correlated with a;. In a post schooling log-wage 
regression, Hausman and Taylor begin by assuming that all three time-varying re- 
gressors (experience, bad health, and unemployment last year) are exogenous, two 
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time-invariant regressors (race and union status) are exogenous, and the time-invariant 
regressor of interest (schooling) is endogenous. In this specification there are two 
overidentifying restrictions. A model specification test is possible using a Hausman 
test based on the difference between Burt and Bw, since the within estimator for G 
is consistent regardless of which components of x;, and w; are correlated with aj. 
Cornwall and Rupert (1988) provide an empirical study that contrasts the various 
estimators. 


22.4.5. SUR and Simultaneous Equations Estimation 


The preceding panel data analysis has focused exclusively on estimation of a single 
equation in isolation. In some cases it may be desired to estimate a system of equations, 
such as a system of demand equations, where dependent variables and regressors are 
observed for many individuals at several points in time. If there are no cross-equation 
restrictions on the parameters then single-equation estimation can yield consistent es- 
timates, but more efficient estimation is possible using joint equation estimation that 
exploits error correlation across equations. 

In the Chapter 21 framework of strongly exogenous regressors, the more efficient 
estimator is an extension of seemingly unrelated regressions from cross-section to 
panel data. The error components SUR model specifies the gth of G equations to 
be given by 


Veit =X gi B+Og + Egi g=1,...,G, (22.31) 


where, as in the cross-section case, œ; is independent over i, €4;; is independent over 
i and ft, and ag; and £g; are independent of each other. However, the error compo- 
nents are allowed to be correlated across components, so that Cov[ag;, &ni] 4 0 and 
CovlE gir, Enit] A O for g 4 h. Then the Chapter 21 methods yield consistent estimates. 
The obvious single-equation estimator is the random effects estimator that is feasible 
GLS controlling for the correlation within each equation. More efficient GLS estima- 
tors that additionally control for cross-equation correlation in the errors are detailed in 
Avery (1977) and Baltagi (1980). 

Similar efficiency gains can be found when the system is one of simultaneous 
equations, where now in (22.31) the regressor X,;; may include one or more endoge- 
nous regressors y;;; from other equations. Then IV or GMM estimation of each single 
equation yields consistent estimates, with the obvious estimator given the error com- 
ponents structure being the random effects IV or EC2SLS estimator of Section 22.4.3. 
More efficient estimates are obtained by systems estimation, using the error compo- 
nents three-stage least-squares (EC3SLS) estimator proposed by Baltagi (1981). 

The systems estimators are more difficult to implement and separate estimation of 
each equation may be adequate. Even if this simpler approach is taken, however, much 
can be gained in specifying a system of simultaneous equations as it permits identi- 
fication of the coefficients of endogenous regressors using as instruments exogenous 
regressors excluded from the equation of interest. This provides a more traditional ap- 
proach to obtaining instruments than using as instruments exogenous regressors from 
time periods other than the current one. 
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22.5. Dynamic Models 


In this section we consider the usual individual-specific effects panel data model, with 
the complication that the regressors include the dependent variable lagged once. Then 
the model is a dynamic model with 


Yit = Vin +X B +i + Ei, i=1,...,N, t=1,...,T. (22.32) 


As usual the panel is short with data independent over i. It is assumed that |y| < 1, an 
assumption relaxed in Section 22.5.4. 

An important result is that even if œ; is a random effect, OLS estimation of (22.32) 
leads to inconsistent estimation of y and 68. This is because the regressor yis—1 is 
correlated with œ; and hence with the composite error term (œ; + €;;). Alternative 
estimators are needed even with random effects. 

We consider estimation when a; is a fixed effect, |y| < 1, the error ¢;, is serially 
uncorrelated, and the panel is short (see Section 22.5.3). Although this is the base 
case for microeconometrics applications there exists a vast literature that changes one 
or more of these assumptions. More generally the individual-specific effect may be 
purely random, errors may be serially correlated, data may be nonstationary, and the 
panel may be a long panel, but we barely touch on this literature. 


22.5.1. True State Dependence and Unobserved Heterogeneity 


Before considering estimation, we note that time-series correlation in y;; is now in- 
duced directly by y; -1 in addition to the indirect effect via a; already considered 
in Chapter 21. These two causes lead to quite different interpretations of correlation 
over time in, for example, individual earnings or welfare recipiency. 

For simplicity let 6 = 0 so that y; = Y Yir-1 ta; + £i Then E[lyir|Yit-1, a] = 
VYi2—-1 +a; and Cor[y;;, Yit-1|æ;] = y. Conditional on @;, the standard time-series 
results for an AR(1) model apply with dependence over time in y;, determined solely 
by the autoregressive parameter y. However, œ; is unknown and we actually ob- 
serve EL yi: ¥i2—1] = VY Yi, ı-1 + Eloi yi, 1-1] and Cor[y;;, yis-1] Æ y. Specifically, from 
(22.32) with 6 = 0 


Cor[yir, Yit-1] = Cor[y yi,r-1 + Qi + Eit, Yit-1] (22.33) 
= y + Corle;, yir-1] 
]— 
ao a-y) 


1+ (1 — y)o2/(. + y)o2’ 


where the second equality assumes Cor[é;;, y;,—;] = O and the third equality is ob- 
tained after some algebra for the special case of random effects with ¢;, iid [0, o?] and 
a; iid [0, o2]. 

Result (22.33) makes it clear that there are two possible reasons for correlation 
between yj, and yj;_1. 

True state dependence occurs when correlation over time is due to the causal 
mechanism that y;,—; last period determines y;; this period. This dependence is 
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relatively large if the individual effect æ; ~ 0 as then Cor[y;;, y;,;-1] ~ y. More gen- 
erally, this happens when o2 is very small relative to 02. 

Correlation due to unobserved heterogeneity arises even if there is no causal mech- 
anism, so y = 0, but nonetheless there is correlation since Cor[y;;, y;,;-1] simplifies to 
o> /(o2 +02) if y = 0, as in Chapter 21. 

Both extremes permit this correlation to be arbitrarily close to one because ei- 
ther y > 1 or o2/o% — 0. However, these give two quite different explanations with 
quite different policy implications. A true state dependence explanation for earnings 
yj, being continuously high over time even after controlling for regressors x;; is that 
future earnings are determined by past earnings and y is large. An unobserved het- 
erogeneity explanation is that actually y is small, but important variables have been 
omitted from x;;, leading to a high œ; in each time period. For duration data the dis- 
tinction between true state dependence and unobserved heterogeneity was explored in 
Chapter 18. The static linear panel models of Chapter 21 considered only unobserved 
heterogeneity. 


22.5.2. Inconsistency of Standard Panel Estimators 


The estimators from the previous chapter are all inconsistent if the regressors include 
lagged dependent variables, even in the case of the random effects model. We consider 
estimation of the model given in (22.32), where the literature usually assumes that ¢;; 
are serially uncorrelated. 

First consider OLS estimation of y;, on y;,—; and x;,. The error term is then 
(a; + €;;), which is correlated with the regressor y; ;_; since lagging the equation gives 
Vit-1 = VVit-2 + X; 17 +a; + &;;~1, SO that y;,-; is correlated with œ;. Note that 
this is a departure from earlier results for OLS estimation of the random effects model 
without lagged dependent variable, as then OLS of y;; on x;; yields a consistent, albeit 
inefficient, estimator. This is also a departure from the usual OLS result that regression 
of yj; on y; -1 yields a consistent estimate (though one biased in small samples) if the 
error is serially uncorrelated. 

Second, consider the within estimator, which regresses (yit — yi) on (y;,1-1— Yi,—1) 
and (x;; — x;). This regression has error term (€;; — &;). Now by (22.32), yj is corre- 
lated with €;;, so y;,,-1 is correlated with £; ;_; and hence &;. However, this implies that 
the regressor (y; ;-; — y;) is correlated with the error (¢;, — &;). Thus OLS estimation 
of the within model leads to inconsistent parameter estimates, because the regressor is 
correlated with the error term. Consistency requires that €; becomes very small relative 
to €;;, which requires T — oo, which occurs in long panels but not in short panels. A 
leading reference is Nickell (1981). 

Inconsistency also arises for the random effects estimator given in Chapter 21, 
since this is a linear combination of the within and between estimators. For random 
effects models Anderson and Hsiao (1981) instead considered ML estimation when 
£i ~ N[0, 07]; see also Bhargava and Sargan (1983). In short panels the distribution 
of the MLE depends on the assumptions made on yjo, the initial value of the dependent 
variable. Anderson and Hsiao (1981) distinguish among the following initial condi- 
tion assumptions: (1) fixed initial observations, (2) random initial observations with a 
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common mean, (3) random initial observations with different means, and (4) random 
initial observations with a stationary distributions. 

The first differences OLS estimator is also inconsistent, but an IV variant leads to 
consistent estimates. We now present this estimator. 


22.5.3. Arellano—Bond Estimator 
Model (22.32) leads to the first-differences model 


Yit — Yit-1 = Y (Yit-1 — Vir—2) + Xi — Xit- B + (Eir — Eit-1) tt = 2,...,T. 
(22.34) 


The OLS estimator is inconsistent because y; +—1 is correlated with £; 1 from (22.32), 
so the regressor (y;,;-1 — yi,r—2) is correlated with the error (€j; — £i 1—1) in (22.34). 

Anderson and Hsiao (1981) proposed estimating (22.34) using the instrumental 
variables estimator with y; ;—2 as an instrument for (y;,;-1 — Yi,+—2). This is a valid in- 
strument, since y; +—2 is not correlated with (¢;, — €;,,-1) assuming the errors ¢;, are se- 
rially uncorrelated. Furthermore, y; 2 is a good instrument since it is correlated with 
(Vi,t—1 — Yit—2). The method requires availability of three periods of data for each indi- 
vidual. An alternative is to use Ay; ,—2 as an instrument for Ay; ,_;, which will require 
four periods of data. Anderson and Hsiao (1981) present results suggesting that the 
IV estimator is more efficient using Ay; ,—2 rather than y; -2 as the instrument in the 
usual case that y > 0. In either case (x; — x;,;-1) is used as an instrument for itself. 

More efficient estimation is possible by using additional lags of the dependent 
variable as instruments. For example, both y;;~2 and y;,;-3 might be used as instru- 
ments. The model is then overidentified, so estimation should be by 2SLS or panel 
GMM. Furthermore, the number of instruments available is highest for the dependent 
variable observed at time f closest to the final time period T. In period 3 only y;ı 
is available as an instrument, in period 4 both y;; and y;2 are available, in period 5 
Yil, Yi2, and y;3 are available, and so on. Holtz-Eakin et al. (1988) and Arellano and 
Bond (1991) proposed panel GMM estimators using these wider unbalanced instru- 
ment sets. 

The microeconometrics literature refers to the resulting panel GMM estimator as 
the Arellano—Bond estimator. The general procedure has already been presented in 
Section 22.4.2, where dynamics were not explicitly introduced. The estimator is 


Pe N No ONTIS N 
Bay = (£ xn) Wy (> z) (£ xn) Wy (x xi) , 62235) 
il i=1 i=l i=l 


where x isa(T — 2) x (K + 1) matrix with rth row (Ay; ,-1, Axi,), t = 3,...,7, Yi 
isa(T — 2) x 1 vector with tth row Ay;;, and Z; isa (T — 2) x r matrix of instruments 


Zz, O > 0 
Z2 j Ziq aji (22.36) 
0 0 Zr 
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where often z}, = [¥j,1—-2, Vir—3. +--+ Vit, AX;,]. Lags of x;, or AX;; can additionally be 
used as instruments, and for moderate or large T there may be a maximum lag of y;; 
that is used as an instrument, such as not more than y; ;_4. Two-stage LS and two-step 
GMM correspond to different weighting matrices Wy (see Section 22.2.3). 

The method is easily adapted to an AR(p) model, with y y; ,—; in (22.32) replaced 
by ViYi -1 + Y2Yit-2 Fee + YpYi,t-p> though more than three periods of data will be 
needed to permit consistent estimation. 

The empirical example in Section 22.3 is essentially an Arellano—Bond estimation 
example, since a first differences model is estimated by IV with lagged regressors used 
as instruments. 

Ahn and Schmidt (1995) noted that more efficient estimation is possible using ad- 
ditional moment conditions. Consider the pure time-series version of (22.32) where 
B = 0, and make the standard assumption that ¢;, is uncorrelated with a;, ¢;; for 
s Æt and the initial observation y;,;. The Arellano—Bond estimator uses the mo- 
ment conditions E[y;, Au;;] = 0 for s < t — 2, where u;, = €;, + œi. Ahn and Schmidt 
(1995) obtain a more efficient estimator by additionally using the moment conditions 
E[u;7 ^ui] = 0. They show that this estimator, which makes efficient use of the sec- 
ond moment assumptions, is asymptotically equivalent to the optimal minimum dis- 
tance estimator of Chamberlain (1982, 1984). 

Additional assumptions lead to additional moment conditions and hence more effi- 
cient estimation. If V[e;,] = V[e;,] then E[t; Au;,;] = 0 (see Ahn and Schmidt, 1995), 
assuming homoskedasticity of ¢;,. Arellano and Bover (1995) propose using the condi- 
tion E[u;;Ay;s] = 0 for s < t — 1. Blundell and Bond (1998) consider these and addi- 
tional assumptions and show that the benefit can be considerable, especially when y is 
high and T is small. Arellano and Honore (2001) present many assumptions that might 
be made and the corresponding moment conditions that can be used in estimation. 

Hsiao, Pesaran, and Tahmiscioglu (2002) propose a transformed ML estimator. 
Assume that ¢;, are iid N’[0, 07], an assumption that can be relaxed. Rather than form 
the likelihood based on £;1, ..., &;7, they form the likelihood based on the error differ- 
ences Ag;;,..., Ag;r. For the pure time series AR(1) model Ag;, = Ayit — y AYi -1 
for t > 1. The density of As;ı depends on the assumptions made about initial con- 
ditions: either Ae;; = Ay; or Ac; = Ay;; — b, where b = E[Ay,;] is an additional 
parameter to be estimated. The resulting estimator is a quasi-MLE that retains consis- 
tency even if ¢;, are nonnormal. If ¢;, are iid [0, o°] then the transformed MLE is more 
efficient than the preceding GMM estimators. 


22.5.4. Estimation of Covariance Structures 


Covariance structures are models that specify a structure for the covariance matrix of 
the regression error. Applications include structures for error dynamics and for mea- 
surement error. The goal is to estimate the parameters of the structure. 

As an example, suppose that y;; is generated by a random effects model with MA(1) 
error, so that 


Yit = Qi + Eit + Ei 1-1, 
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where a; ~ [0, o2] and £; ~ [0, o?] and |ọ| < 1. na the Puen Yj= 
Cov[ yir, yir—j] satisfy yp = ae +(+ p? oes ales = of + Qoi, and Yj= o? for j > 
2. If T = 3 these equations yield estimates 62, £2, and $ given autocovariance es- 
timates Yo, Yı, and 7. If T > 3 the model is S as there are only three 
variance parameters to estimate but more than three autocovariance estimates. An ob- 
vious estimator is the minimum distance estimator. 

In general let O denote the q structural parameters and suppose g(0) = y, where 


y= [ygn Yroil’ is the vector of T > q autocovariances. Then the minimum dis- 
tance estimator Omp minimizes 
Qn(8) = F—-8(9)) Wn F — g0)), (22.37) 


where ¥ = [7,..., Pr-11 


T 
7, =(NT-/)r' DO 26) = P)O — Fes) (22.38) 
t=j+1 i=1 


and y;-; = N`! J; yis—j. The weighting matrix Wy and further details on MD es- 
timation are provided in Section 6.7. The restrictions of the model can be tested by 
use of the chi-squared test statistic given in Section 6.7. The discussion thus far has 
already imposed the restriction of covariance stationarity. One can more generally per- 
mit y,; # Ys; fort # s, where y,; = Cov[yir, Yi,- j]. Then y has T(T + 1)/2 entries 
Yip t=j+ 1,...,7 and j =0,..., 7 — 1. The stationarity assumption is itself a 
testable assumption. Moreover, regressors can be incorporated by replacing y;,; by the 
residual y;; — x 

Abowd and Card (1989) provided an early application of this approach to joint 
modeling of earnings and hours. Altonji and Segal (1996) demonstrated that the opti- 
mal MD estimator can be quite biased in finite samples (see Section 6.3.5). Many of 
the applications are to models of earnings; see Baker and Solon (2003) for a recent 
example. 

The MD approach is well suited to estimation of covariance structures. The panel 
data sets can be large, but by first estimating the autocovariances the estimation is 
reduced to minimizing (22.37). Other estimation approaches are possible. In particular, 
see MaCurdy (1982b), who presents Box—Jenkins type models for panel data. 


22.5.5. Nonstationary Panels 


The panel literature on unit roots and nonstationarity emphasizes panels where both N 
and T are large. For unit root tests a key early paper is that by Levin and Lin (1992), 
ultimately published as Levin, Lin, and Chu (2002); Pesaran and Smith (1995) wrote 
an early paper that considered cointegration. Phillips and Moon (1999) and Pedroni 
(2004) provide general theory for inference with nonstationary panel data. Analysis is 
simplest using a sequential limit theory where, say, first N is fixed and T — oo and 
subsequently N — oo. A more robust approach uses joint limits where T — oo 
and N —> oo simultaneously. Recent reviews of the literature include those by Phillips 
and Moon (2000) and Baltagi (2001, Chapter 12). 
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Less consideration has been given to nonstationary data in short panels. Harris and 
Tzavalis (1999) consider the unit root tests of Levin and Lin (1992) in short panels. 
Let Y denote the within estimate of y in the AR(1) fixed effects model y; = a; + 
YYia—1 + Eir, where £; ~ NO, o°]. We consider the null hypothesis of a unit root, so 
y = 1, and no intercept a; = 0, which corresponds to the pure time series case 2 in 
Hamilton (1994, p. 490). Under this null hypothesis the unit root test statistic 


VNG -1+3/T +1) 
[3(17T? — 20T + 17)/[5(T — 1X(T + 1)] 


as N —> œ for fixed T. Large negative values of this statistic lead to rejection of the 
unit root hypothesis. Levin and Lin (1992) provide additional tests, such as for models 
with individual time trends. 

Binder, Hsiao, and Pesaran (2003) consider short panel estimation of fixed effect 
dynamic panel models with unit roots and cointegration. With unit roots the Arellano— 
Bond estimator is inconsistent, though the extensions due to Ahn and Schmidt (1995) 
and others discussed at the end of Section 22.5.3 yield consistent estimates. Binder 
et al. (2003) propose quasi-ML estimators that perform better in finite samples when 
unit roots are present. 


4 NIO, 1] 


22.6. Difference-in-Differences Estimator 


The evaluation literature presented in Chapter 25 focuses on measuring the treatment 
effect, in the simplest case the impact or marginal effect of a single binary regressor 
that equals one if treatment occurs and equals zero if treatment does not occur. For 
example, interest may lie in measuring the effect on earnings of a policy change (the 
binary treatment) that alters tax rates or welfare eligibility or access to training for 
some individuals but not for others. 

In this section we relate one of the methods of Chapter 25 to panel methods. Specif- 
ically the treatment effect can be measured using standard panel data methods if panel 
data are available before and after the treatment and if not all individuals receive the 
treatment. Then the first-differences estimator for the fixed effects model reduces to 
a simple estimator called the differences-in-differences estimator, introduced in Sec- 
tion 3.4.2 and also studied in Section 25.5. The latter estimator has the advantage that 
it can also be used when repeated cross-section data rather than panel data are avail- 
able. However, it does rely on model assumptions that are often not made explicit. The 
treatment here follows Blundell and MaCurdy (2000). 


22.6.1. Fixed Effects with Binary Treatment 
Let the binary regressor of interest be 


p= | 1 if individual i receives treatment in period t, (22.39) 


0 otherwise. 
Assume a fixed effects model for y;; with 
Vir = Dir + ôt + Qi + Ein, (22.40) 
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where ô, is a time-specific fixed effect and a; is an individual-specific fixed effect. As 
noted in Section 21.2.1 this is equivalent to regression of y;, on Dj; and a full set of 
time dummies with the complication of individual-specific fixed effects. For simplicity 
there are no other regressors. 

The individual effects œ; can be eliminated by first differencing. Then 


Ayr = PAD: + (8; — b1) + AE (22.41) 


The treatment effect ¢ can be consistently estimated by pooled OLS regression of Ay;; 
on A D; and a full set of time dummies. 


22.6.2. Differences in Differences 


Now consider specialization to only two time periods. Furthermore, suppose treatment 
occurs only in period 2, so that in period 1 Dj, = 0 for all individuals and in period 2 
Diz = 1 for the treated and D;. = 0 for the nontreated. Then the subscript t can be 
dropped from (22.41) and 


Ay; = D; +6 + vi, (22.42) 


where D; is a binary treatment variable indicating whether or not the individual re- 
ceived treatment. 

The treatment effect can be estimated by OLS regression of Ay on an intercept and 
the binary regressor D. Define AJ" to denote the sample average of Ay; for the treated 
(D; = 1) and Ay" to denote the sample average of Ay; for the nontreated (D; = 0). 
Then the OLS estimator reduces to 


$ = As" — AJ", (22.43) 


This estimator is called the differences-in-differences (DID) estimator, since one 
estimates the time difference for the treated and untreated groups and then takes the 
difference in the time differences. 

The estimator is appealing for its intuitive simplicity. Additionally, it can be ex- 
tended from panel data to the case where separate cross sections are available in the 
two periods. In the second period compute the DEGE yy and ¥5' for the treated and 
untreated groups. Compute similar averages yj and y? in the first pretreatment period. 
This assumes that it is possible to identify in the first period whether or not an individ- 
ual is eligible for treatment. This is easy if, for example, the treatment applies only to 
women and data on gender are available. Then compute 


= (F — Hf) — Gat — FP). (22.44) 


As an example, if average annual earnings for the group eligible for treatment equals 
10,000 before treatment and 13,000 after treatment then ¥ — YF = 3,000. Similarly, 
if average annual earnings for the group not eae for treatment equals 15,000 before 
treatment and 17,000 after treatment then 3" — y1" = 2,000. The DID estimate of the 
treatment effect gi is then 3,000 — 2,000 = 1,000. 
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22.6.3. Assumptions Underlying Differences in Differences 


The preceding formulation of the DID estimator makes explicit the underlying as- 
sumptions for consistent estimation of ġ. 

First, it is assumed that the time effects 5, are common across treated and untreated 
individuals. For example, time trends may differ by gender, in which case identifying 
@ is problematic if treatment depends on gender. The common trends assumption is 
needed if either panel or cross-section data are used. 

Second, if cross-section data are used then the composition of the treated and un- 
treated groups is assumed to be stable before and after the change. With panel data 
differencing eliminates the fixed effects w;. With repeated cross-section data the origi- 
nal model (22.40) implies that 9" = @ + 6, + @ + &¥ and Y = ô; + a + &"". Given 
that treatment only occurs in the second period it follows that 


b = (HF — 9) — OF — yh) + (at — at) — (a -— a) o, 


where v = (&5' — 87) — (&5' — 81). Consistency of @ in (22.44) occurs if plim(@5 — 
a) = 0 and plim(@5' — aj") = 0. This will happen if assignment to treatment is ran- 
dom. However, often this is not the case. 


22.6.4. Richer Models 


In practice richer models are used. An obvious extension is to include regressors 
other than the treatment indicator and time dummies. By grouping data the individual- 
specific effects can at least be permitted to differ on average across groups. The general 
procedure is to estimate 


Vigt = Q Digt + ôr + Qj + Eir 


where g denotes the gth group. 

In a classic example of DID estimation, Card (1990) studied the effect on unemploy- 
ment of low-wage workers in Miami of a sudden influx of immigrants from Cuba. This 
example is also reviewed in Angrist and Krueger (1999). Athey and Imbens (2002) 
present extension to nonlinear models. 


22.7. Repeated Cross Sections and Pseudo Panels 


The key potential advantages of panel data arise from being able to observe subjects 
over time. This makes it possible to control for unobserved individual heterogeneity, 
differences in initial conditions, and dynamic dependence of outcomes. In many cases, 
however, genuine panel data are unavailable. 


22.7.1. Repeated Cross Sections 


We consider analysis when data are for several repeated cross sections, derived from 
responses to a series of independent sample surveys, where independence means that 
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each subject appears in only one survey. An example is the U.K. Family Expenditure 
Survey, which collects a large annual sample of household expenditure data but each 
year surveys different families. Also, if only a very short panel is available (e.g., T = 
2) then data from repeated cross sections are appealing if they can generate a larger 
and richer sample. 

For a random effects model repeated cross-section data pose no challenges. One 
simply performs a pooled regression of y;, on X;; (see Section 21.5) and statistical 
inference is actually simplified as correction is needed only for heteroskedasticity since 
here errors are independent over both i and t. 

With fixed effects, however, pooled regression leads to inconsistent parameter es- 
timates. Furthermore, alternative methods such as the within or first-differences es- 
timation are infeasible if individuals are observed at only one point in time. In this 
section repeated cross-section data are used to construct pseudo panels or synthetic 
panel data that have some of the advantages of genuine panel data, most notably the 
ability to control for fixed effects. A special case is the DID estimator presented in 
Section 22.6. 


22.7.2. Pseudo Panels 


Browning, Deaton, and Irish (1985) and Deaton (1985), in their empirical studies 
based on the U.K. Family Expenditure Survey, considered methods for analyzing re- 
peated cross-section data. Their suggestion was to convert the individual-level data 
into cohort-level data. Although individual household expenditures cannot be tracked 
through time, it is possible to do so for cohorts of individuals. 

A cohort is defined as “a group with fixed membership, individuals of which can 
be identified as they show up in the surveys” (Deaton, 1985, p. 109). An example is an 
age cohort such as males born between 1965 and 1970. For large samples, successive 
surveys will generate random samples of members of each cohort. 

Time series of sample averages of cohorts can form the basis of regression models. 
Whether synthetic panels based on cohort data can substitute for genuine panel data 
is a key issue. The topic of repeated cross section deals with inference procedures for 
such models. Here we focus on static pseudo panel models. Collado (1997) and Girma 
(2000) also consider the dynamic case. 

The starting point is the static linear regression with individual fixed effects a;, 
based on T successive cross sections, 


Yit = Qi +x, B + uir, ER eee iP (22.45) 


The explanatory variables are assumed to be strongly exogenous with respect to pa- 
rameters of interest, 3, so E[x’,ujs] = 0, Vt, s. For simplicity, we assume that N ob- 
servations are available for each cross section. Each individual is observed in only one 
time period, so the individual-specific effects œ; cannot be swept out by differencing 
the individual-level data. 

Let g be a random variable that determines cohort membership for each i, such that 
i belongs to cluster c if and only if g; belongs to the set Je. Assume that there are C 
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cohorts, and c is the cohort subscript, c = 1,..., C. Taking expectations conditional 
on g; yields 
Elyirlg: € l] = Eloilg: € I] + E[x,1¢i € Le] @+ Eluilgi € I]. (22.46) 


This generates a cohort population version of the model (22.45) given by 
Ya = OL + XB + Ugs (22.47) 


where the asterisks denote unobservable population cohort averages. For example, 
ye, = Elyirlgi € Ie]. 

The parameter w* = E[a;|g; € Ic] is the cohort fixed effect. An important assump- 
tion made in the case of fixed effects is that the population is stationary so that a* can 
be assumed to be constant over time. This is qualitatively similar to the assumption 
needed for consistency of the DID estimator made at the end of Section 22.6.3. Under 
the usual weak exogeneity assumptions E[u*,|x7,] = 0. However, the unobserved fixed 
effect w* will be correlated with x*, if œ; is correlated with x;; in the original model 
(22.45). Estimation needs to control for the fixed effect. 

In practice the population cohort means are unobservable and we instead work with 
cohort-time averages y,, and X,. The regression is then 


Yo = üe +X, Gti, c=Hl,...C, t=1,...,T. (22.48) 


This step introduces an additional source of error, since Yer and Xe, are error- 
contaminated estimates of the population cohort averages, that is, 


Ja = Ya FE as (22.49) 


= * 
Xet = Xor +Ver. 


If the measurement error is very small, owing to the number of observations per 
cohort per time period (Na+) being very large, then Yer ~ y*, and Š« = xž, and the 
measurement error can be ignored. A consistent estimate of G can be obtained by 
within estimation of (22.48), that is, OLS regression of (et — Ye) on (Ke; — Xe), where 
Je = T7! J, Jer and x = T7! ka 

Unfortunately, the measurement error is often too large to ignore. Then within es- 
timation of (22.48), or even OLS estimation of (22.48) when @, is a random effect, 
leads to inconsistent estimation of 3. Instead, errors-in-variables estimators need to be 
used. These can be implemented here since the individual-level data yield necessary 
estimates of the moments of the measurement error, see Section 26.3.3. 


22.7.3. Measurement Error Estimators for Pseudo Panels 


A classic solution to measurement errors is to use replicated observations to estimate 
the covariance matrix of the measurement error, and to then use these estimates to 
“correct” the sample moments of the error-contaminated variables before applying 
the least-squares procedure (see Section 26.3.4). Deaton (1985) proposed using this 
method in the current setting. 
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Assume that individual observations satisfy the equations 


* 

Yit = Vor + Eit 
x 

Xit = Xet T Nit» 


a setup similar to that in Section 26.2.1, except that there is also measurement error in 
the dependent variable, and assume that for any individual in a given cohort c, 


HRe 

Nit 0] Loo & 

Sample estimates of (©, o9;), denoted (S, 001), can be obtained given (Yer, Xer) from 
using all individual-level data. Define d, to be the C x 1 column vector of dummy 
variables corresponding to the fixed effects a* (see Section 21.2.1), which is a regres- 


sor vector that is clearly not subject to estimation error. Then provided T is sufficiently 
large and the relevant inverses exist, the regression 


Ger did. dk, J) |< dl Fer 
eee les d; x’, i ae de -P ve) 
will provide consistent estimates of the cohort regression as CT — oo. This estimator 
is the same as that given in Section 26.3.4, with adaptation here because Ye is also mea- 
sured with error and with simplification because only a subset of the regressors, Xer, is 
measured with error. Verbeek and Nijman (1992) provide a more detailed discussion 
of the sampling properties, and Deaton (1985) presents variance estimation. See also 
Verbeek (1995). 

The preceding estimator essentially controls for the cohort fixed effects by estimat- 
ing the least-squares dummy variable model, adjusting for measurement error by use 
of replicated data using the estimator given in Section 26.3.4. 

Collado (1997) considered an alternative approach of eliminating the cohort effects 
by first differencing, and then controlling for measurement error through instrumental 
variables estimation, an alternative identification strategy for measurement error given 
in Section 26.3.2. 

Substituting (22.49) into (22.47) gives 


Yet — Ee = OG + (Z = v) B+ ul, 
Yet = až + xB + Wer, 


where the error wer = už, — V/B + £ar- First differencing eliminates a*, leading to 


Ave = AX, B+ Awa, t=2,...,T. (22.51) 


Now because of the measurement error terms the explanatory variables Ax’, will be 
correlated with Aw,;, and hence applying least squares will lead to inconsistent esti- 
mation. Consistent estimates can be obtained by IV estimation based on lagged levels 
of exogenous variables, that is, x, ,_,. This approach has the advantage of ready ex- 
tension to models with lagged dependent variables. For details see Collado (1997). 
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22.8. Mixed Linear Models 


The model called the random effects model by econometricians specifies only the in- 
tercept coefficient to be random. Richer random effects models, widely used in other 
areas of applied statistics, additionally permit the slope parameters to be random. In 
this section we present mixed linear models — also called mixed effects models, hierar- 
chical, or multilevel linear models (see Chapter 24), random coefficients models, and 
variance components models. 

These models are applied in a setting where the pooled OLS estimator is still con- 
sistent. In particular, there are no fixed effects. Because the mixed linear models frame- 
work provides enough structure to permit estimation by feasible GLS, its estimates are 
more efficient. 


22.8.1. Mixed Linear Models 


The mixed linear model specifies 
Vir = ZB + Wi,0 + Eir, (22.52) 


where the regressors z;; include an intercept, w;; is a vector of observable characteris- 
tics, œ; is a random zero-mean vector, and ¢;; is an error term. This model is called a 
mixed model as it has both fixed parameters G and zero-mean random parameters 
or random effects a;. 

The random intercept model y;; = z}, 6 + a; + &jr is a special case of (22.52) with 
W. Qi = Qj. 

Another special case of (22.52) is the random coefficients model or random pa- 
rameters model. In the regression setting we suppose that 


/ 
Yit = Zibi + Eit, 


a regular linear regression, except that the regression parameter vector now differs 
across individuals according to 


Bi =B+a;, 


where œ; is a zero-mean random vector. Substitution yields y; = Z 6 + Zi Œi + £it, 
which is (22.52) with w;; = Z;;. 

Many applications lie between random intercept and random coefficients models, 
with w;, often a subset of z;;. In particular, standard mixed and random ANOVA mod- 
els are also a special case, where the kth component of the vector w; is either zero or 
one, according to various possible models for clustering the data. For example, one of 
the components in z;; may be a race or gender indicator variable. Then the conditional 
mean of y;; varies with gender or race. It may also be felt that the conditional variance 
of yis also varies with gender or race, which can be captured by inclusion in w;;. The 
mixed model is an outgrowth of ANOVA models. The hierarchical linear model or 
multi-level linear model (see Section 24.6.2) can also be expressed as a special case 
of (22.52). 
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22.8.2. Estimation 


The goal is to estimate the fixed regression parameters 3 and the variances and co- 
variance parameters of the distributions for a; and ¢;,. One of the early treatments of 
this model was in a Bayesian context by Lindley and Smith (1972). A simple example 
of their general treatment was the random coefficients model with y; ~ N[z;,G;, o°], 
where 3; ~ N[y, T]. See Koop (2003), for example, for Bayesian analysis of the 
linear panel data model. 

Here we follow the classical approach, based on the work of Harville (1977), who 
gives references to the earlier literature. The mixed model (22.52) can be split into 
a deterministic component x;, 6 and a random component w;,a; + €ir. The stochastic 
assumptions include the assumption that the regressors x;; are independent of the zero- 
mean random components a; and €;;. So pooled OLS regression of y;; on x;; provides 
consistent estimates of 3. We are essentially in the world of Section 21.5, with feasible 
GLS estimation possible as structure has been placed on the variance matrix of the 
error term w;,a; + €;;. In this section we present the feasible GLS estimator along 
with two different methods to estimate the variances and covariances of œ; and ¢;; and 
consider prediction of the random components a;. 

Combine observations over time for a given individual in the usual way, so that 
(22.52) becomes 


yi = Zib + (Wia + £i). (22.53) 


The usual assumptions are that œ; and €; are independent over i and independent of 
each other with œ; ~ [0, X7] and e; ~ [0, &.], so that the error term 


Wa; + €; ~ [0, Q; = Wi &a W; + Xe]. 


Then the feasible GLS estimator is 
e3 j N 


N 

A inl inl 

Brors = È Zi Q; z| X ZQ; yi, (22.54) 
Ei =I 


where Q; is consistent for Q;. 
Implementation requires consistent estimation of Q;. This has already been dis- 


cussed in Section 21.7 for the simpler case of a random intercept, in which case there 
were several different ways to consistently estimate the variance components o? and 
o2, with complications such as bias and the possibility of negative estimates. Similar 
issues arise here in estimation of Xa and Se. 

We present two estimators based on the additional assumption of normal distribu- 
tion for the random components. The presentation is for the more general model 


y= Z6+ (Wa +£), (22.55) 


which can be obtained, for example, by appropriate stacking of (22.53). It is assumed 
that a ~ N[0, G] and e ~ N[0, R], where in the current application G and R are 
functions of Xa and X+. The feasible GLS estimator for the mixed model is 


7) 1o- =1 i 
Brors = [zV 'Z] ZV 'y, 
where V is consistent for V = V[Wa + e] = WGW’ +R. See Swamy (1970). 
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The obvious method for obtaining V is maximum likelihood. The log-likelihood 
function based on the multivariate normal, after concentrating out 3 which is equal to 
the GLS estimator [zv z] Z'V~'y, is 


1 NT NT 2 
MUG R= -3 mV arve +n (a) 


where r=y-Z [zv z] Z'V~'y and |V| denotes the determinant of V. Maxi- 
mization with respect to the parameters in G and R yields V = WGW + R. 

A weakness of ML estimates of variance components are that they are biased in 
small samples. For example, for cross-section linear regression with homoskedastic 
errors the MLE a” = N`! 9, ii? is biased and it is better to instead divide by (N — K). 
For the model (22.53), degree-of-freedom corrections are provided by the restricted 
maximum likelihood estimator that instead maximizes 


1 NT — NPS 2 
In Le(G, R) = —=In|V| Fe eS a 
2 2 2 NT —p 


1 y-l 
— -ln|ZV Z], 
2 
where p is the rank of Z. For motivation of In Lg(G, R) see Harville (1977). 
As an empirical example of a mixed linear model, consider the In(hours)—In(wage) 
regression example of Section 21.3 with both the intercept and slope parameters per- 
mitted to be random. Then the random coefficients model yields Inhrs = 7.734 — 


0.021Inwg with slope coefficient standard error of 0.046 (default) or 0.020 (panel boot- 
strap). The slope coefficient is quite different from the estimates given in Table 21.2. 


22.8.3. Prediction 


We may wish to predict the random parameters œ in addition to the fixed parameters 
@ and the covariance parameters. 
The joint normal equations for B and Q, given consistent estimates of B and @, can 


be written as 
[wrz wie wre l[a] [wes] 
WRZ WR W+G! ]||@& WR! 
Solving for B gives Brors given earlier, whereas 
@ = GWV! y — Z’p). 
In the case of independence over i, this yields Q; = Sa Wi Vy; — ZÂ). This is the 


best linear unbiased predictor if the variance matrices are known. 


22.9. Practical Considerations 


The panel 2SLS estimators can actually be estimated using just a 2SLS program for 
cross-section data (see Section 22.2.5) though computed standard errors need to be 
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panel robust. Optimal GMM estimators can be implemented using matrix commands 
in a statistical package or in a programming language such as GAUSS. Statistical pack- 
ages are increasingly offering panel commands that automatically implement the esti- 
mators of this chapter, most notably the Arellano—Bond estimator. 


22.10. Bibliographic Notes 


This chapter covers an active area of research that appears in several recent texts devoted to 
panel data, notably those by Baltagi (1995, 2001), Hsiao (1986, 2003), M-J. Lee (2002), and 
Arellano (2003). More advanced methods are given in Matyas and Sevestre (1995) and in Arel- 
lano and Honore (2001). 


22.2 Chamberlain (1982, 1984) emphasized the use of exogeneity assumptions. He used min- 
imum distance estimation. The subsequent literature has used GMM methods. M-J. Lee 
(2002) and Arellano (2003) especially emphasize GMM estimation. See also the survey 
by Ahn and Schmidt (1999). 

22.4 The model of Hausman and Taylor (1981) is attractive. By assuming that some regressors 
are uncorrelated with the individual-specific effect it permits identification of the coeffi- 
cients of time-invariant regressors. 

22.5 The coverage of linear dynamic models is very brief compared to the size of the literature 
that began with Balestra and Nerlove (1966). More complete discussions are given in 
Baltagi (2001, Chapter 8), Hsiao (2003, Chapter 4), and Arellano (2003, Chapter 5-8). 
The Arellano—Bond (1991) estimator is especially popular as it accommodates dynamic 
models with fixed effects. 

22.6 The difference-in-differences approach is very popular because of its simplicity. Although 
it can be used with repeated cross-section rather than panel data, a panel data interpreta- 
tion helps make explicit the underlying assumptions. Bertrand et al. (2004) demonstrate 
the importance of correcting for time series correlation at the individual level using the 
methods of Section 22.2.3. 

22.8 Mixed linear models are especially popular in the statistics literature. They are less used 
in the econometrics literature, because of the reluctance to impose structure on the time- 
invariant individual-specific fixed effect. 


Exercises 


22-1 Consider the panel GMM estimator of Section 22.2.1. 

(a) Show that minimization with respect to G of the quadratic function Qy(B) 
given after (22.3) yields the panel GMM esimator given after Qy() that is 
expressed using summation notation. 

(b) Show that this estimator is equivalent to the estimator defined in (22.4). 

(c) For simplicity suppose that the matrices Z and X in (22.4) are nonstochastic 
and that y = XG + u where u has mean 0 and variance Q. Obtain the finite 
sample variance matrix of the estimator in (22.4) and compare this to the 
asymptotic results in (22.5). 

(d) Simplify the panel GMM estimator in the case that r = K. 


22-2 Consider the panel data model y;=a+ BXit + ywet+ur,i=1,...,N,t= 
1,..., T, where for simplicity there is no individual-specific effect. Suppose the 
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scalar regressor x;; is correlated with uj; for all t and s. For each of the following 

parts state whether consistent IV estimation of 6 and y is possible, and if so give 

all the suitable instruments, based on the discussion in Section 22.2. Assume 

that three periods of data are available, so T = 3, and note that a variable may 

not be available as an instrument in all years, and that in different years different 

instruments may be available. 

(a) The regressor w;; satisfies the summation assumption E[}°, wjujr] = 0. 

(b) The regressor w;; satisfies the contemporaneous exogeneity assumption 
E[w;tU;it] = O, t= Tsss ,3. 

(c) The regressor w;; satisfies the weak exogeneity assumption E[wjsuj;] = 
0 Sst t= lhasa: 

(d) The regressor w;; satisfies the strong exogeneity assumption E[w;tU;it] = 
OS f= 1,...,3. 

Repeat question 3, again with three periods of data, but now consider the panel 

data model yj; = aj + Xit + y Wit + Uit, Where a; is a fixed effect, and consider 

IV estimation based on the first differences model, Yit — Yi t-1 = B(Xit — Xit-1) + 

Y (Wit — Wit-1) + (Uit — Uj,t-1)- 

Consider the differences in differences (DID) estimator presented in Sec- 

tion 22.6. Suppose the time trend term (5; — 6;_1) differs across the treated and 

untreated groups. 

(a) Will the DID estimator of p based on repeated cross-section data be con- 
sistent? Explain your answer. 

(b) Is consistent estimation of ¢ possible if panel data are available? Explain 
your answer. 

Using the hours and wages data of Ziliak (1997) reproduce as much of Ta- 

ble 22.2 as you can, with appropriate discussion, when the instrument set is 

expanded to include the third lags of Inwg, kids, age, agesq, and disab and the 

seven years 1982-88 are used to estimate (22.22). 
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CHAPTER 23 


Nonlinear Panel Models 


23.1. Introduction 


This chapter extends the linear model panel data methods of Chapters 21 and 22 to the 
nonlinear regression models presented in Chapters 14—20. We focus on short panels 
and models with a time-invariant individual-specific effect that may be fixed or may 
be random. Both static and dynamic models are considered. 

There is no one-size-fits-all prescription for nonlinear models with individual spe- 
cific effects. If individual-specific effects are fixed and the panel is short then consistent 
estimation of the slope parameters is possible for only a subset of nonlinear models. 
If individual-specific effects are instead purely random then consistent estimation is 
possible for a wide range of models. 

Section 23.2 presents general approaches that may or may not be implementable for 
particular models. Section 23.3 provides an application to a nonlinear model with mul- 
tiplicative individual-specific effects. Specializations to the leading classes of nonlin- 
ear models — discrete data, selection models, transition data, and count data — are pre- 
sented in Sections 23.4—23.7. Semiparametric estimation is surveyed in Section 23.8. 


23.2. General Results 


General approaches to extending the methods for linear models are presented in this 
section. We first present the various models — fixed effects, random effects, and pooled 
models, distinguishing parametric from conditional mean models. Methods to estimate 
these models and obtain panel-robust standard errors are then presented. Further details 
for specific nonlinear panel models are provided in subsequent sections. 


23.2.1. Individual-Specific Effects Models 


The linear individual-specific effects model (see Section 21.2.1) specifies that the 
dependent variable y;, depends on a time-invariant individual-specific effect a;, as 
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well as the usual regressors x;, and regression parameters 3. The model is written as 
Yit = Qi + x, B + Uit, Where u;; is an error term. 

For nonlinear models such as logit and Poisson models there is less motivation 
for introducing an additive error u;;. Instead, it is more natural to directly model the 
conditional density, or the conditional mean, which in the linear case can be expressed 
as EL yi;|o;, Xir] = oj + X; p. 


Parametric Models 


A fully parametric approach is common for many nonlinear models, most notably 
models for binary, multinomial, and censored outcomes given in Chapters 14-16. 

The standard cross-section models are single-index models, or single-index models 
with additional scale parameter(s). The parametric individual-specific effects mod- 
els presented in subsequent sections specify conditional density 


f Oili, Xi) = f Oir di +x B, V), (23.1) 


where ~y denotes additional parameters such as variance parameters. The model is a 
single-index model in the regressors x;, and the individual effects a;. 

The usual assumption is that y;;|x;;, @; is independent over both i and t. This can 
be relaxed to permit dependence over t for given i (see Section 23.2.6). 


Conditional Mean Models 


A quite general nonlinear model for the conditional mean, with unobserved time- 
invariant individual-specific effect, is 


ELyir lai, Xii] = 8(Qi, Xit, B), i=1,...,N, CST) ce SPS (23.2) 


for given function g(-). Three common specifications are an additive individual- 
specific effects model 


8 (i, Xir, B) = a + B(Xir, B), (23.3) 
a multiplicative individual-specific effects model, 
8(;, Xir, B) = digin, B), (23.4) 
and a single-index individual-specific effects model 
g(a, Xir, B) = g(a + X, b). (23.5) 


In each case the function g(-) is specified. The regressors x;, may be time varying or 
time-invariant and may include a time dummy. 

The additive effects model is suited to applications where the range of y;; is 
unbounded, as implicitly assumed with linear regression. The multiplicative effects 
model is suited to applications where y;; is nonnegative unbounded, such as count data, 
in which case a; > 0 and g(-) > 0. The single-index model is a natural starting point 
for the probit model, for example, with g(a; + x;,3) = ®(a; + x’, 3), where ®(-) is the 
standard normal cdf. The single-index model reduces to the additive model if g(-) is 
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the identity function. It reduces to the multiplicative model if g(-) is the exponential 
function, since then exp(a@; + x; 6) = exp (@;) exp(x;, 3). 

The moment condition (23.2) conditions only on current period x;, and assumes that 
regressors are contemporaneously exogenous (see Section 22.2.4). Elimination of the 
individual-specific effects æ; can require stronger exogeneity assumptions. Regressors 
are weakly exogenous if 


Elyirlo, Xi, ---, Xi] = 9(Q%, Xir, B) (23.6) 
and strongly exogenous or strictly exogenous if 
Elyirlaj, Xi1,-.-, Xir] = (i, Xir, B). (23.7) 


A nonlinear model with additive effects adds relatively few complications. In 
particular, if the panel model is y; = a; + 8(Xir, B) + Uir, then the approaches of 
Chapters 21 and 22 will carry through with some modification, including estimation 
by nonlinear LS and IV rather than linear LS and IV. 

This chapter focuses on models with nonadditive individual-specific effects, such as 
in (23.4) and (23.5). These effects can be treated as fixed effects or as random effects. 


23.2.2. Fixed Effects Models 


A fixed effects model treats the individual-specific effect æ; as an unobserved random 
variable that may be correlated with the regressors x;;. In short panels joint estimation 
of the fixed effects a, ...,a@y and the other model parameters, 8 and possibly ~y, gen- 
erally leads to inconsistent estimation of all parameters. Instead, a variety of methods 
have been proposed that eliminate the fixed effects in some special settings, permitting 
consistent estimation of the other model parameters. 


The Incidental Parameters Problem 


Neyman and Scott (1948) considered inference when some parameters are common 
to all observations but there are additionally an infinity of parameters, each of which 
depends on only a finite number of observations. The common parameters are of 
intrinsic interest, whereas the latter parameters are called incidental parameters. 

Here 8 and y are common parameters, but a1, ..., @y are incidental parameters if 
the panel is short as then each aw; depends on fixed T observations and there are in- 
finitely many a; since N — oo. The incidental parameters are inconsistently estimated 
as N — oo, since only T observations are used to estimate each parameter. The inci- 
dental parameters problem is that this contaminates the estimation of the common 
parameters. In general the common parameters are also inconsistently estimated, even 
though they are finite in number and are estimated using NT — œ observations. 

A simple illustration of contamination by incidental parameters is to suppose that 
yi ~ Noi, o°]. Maximum likelihood estimation yields &; = y;, i = 1,..., N, and 
6 =(NT)! E; 0% — yi). Then E[G?] = 0?(T — 1)/T, so G is inconsistent 
for ø? as N — oo in the short panel setting of fixed T. This inconsistency can be 
very large, with 2 + 0.502 when T = 2. 
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In general if there is an incidental parameters problem, alternative estimation meth- 
ods are needed that first eliminate the incidental parameters. For some popular models, 
most notably the panel probit model, there is no solution to the incidental parameters 
problem. Even where methods exist to consistently estimate 6 these methods tend 
to be model specific, as emphasized by Lancaster (2000). No unified solution to the 
incidental parameters problem exists. 


Conditional Likelihood 


A statistic t is called sufficient for a parameter 6 if the distribution of the sample given 
t does not depend on 0. For individual-specific effects panel models, if a sufficient 
Statistic exists for the nuisance parameter œ; then by conditioning on this sufficient 
Statistic the nuisance parameter is eliminated. The resulting conditional density de- 
pends only on the common parameters, permitting consistent estimation. 

Let y; = Lyi, ---, Yir] be a T x 1 vector dependent variable for individual i over 
all T time periods, and let X; = [X;1, ..., Xir]! denote the corresponding T x K ma- 
trix of regressors. For a static model y; has density 


T 
FIX, 01, B, V = | | FOX, æi, B, 7). (23.8) 
t=1 


Maximum likelihood estimation based on this density generally leads to inconsistent 
estimation of 6 in short panels owing to the incidental parameters problem. 

Suppose there exists a sufficient statistic s; for œ;. Then conditioning on the suffi- 
cient statistic s;, in addition to the usual conditioning on regressors, leads to condi- 
tional density 


S(yi|Xi, oi, B,,8;)) = f(yiIXi, B, Y, 8), (23.9) 


so that œ; has dropped out. For example, for the linear regression model under nor- 
mality s; = y; (see Section 21.6.3). Then the conditional MLE maximizes the condi- 
tional log-likelihood 


N 
In Leonn(B, y) = $ In f:1X;, B, y, s). (23.10) 
i=l 


The adjective conditional is added here to indicate conditioning on s; and not just X;. 

Andersen (1970) provided a detailed analysis of the conditional MLE. He showed 
that the conditional MLE is consistent if the density f(y;|X;, @;, B) is correctly spec- 
ified, that the information matrix equality holds for the conditional log-likelihood, but 
in general there is a loss of efficiency as the conditional MLE need not attain the 
Cramer—Rao lower bound. For normal and Poisson distributions, however, there is no 
efficiency less. 

The approach requires that a suitable sufficient statistic exists. This is the case for 
only a few models, essentially those of the linear exponential family. Andersen focused 
on models without regressors and gave as examples the normal, Poisson, binomial, 
and gamma. Once regressors are introduced it becomes even more difficult to find 
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a suitable sufficient statistic. McCullagh and Nelder (1989) provide a quite general 
discussion and Diggle et al. (2002) restrict their attention to specialized GLMs with 
canonical link functions. 

The leading examples when a sufficient statistic is available are linear models un- 
der normality (see Section 21.6.2), logit models (though not probit models) for binary 
data (see Section 23.4.3), one-parameter gamma (including exponential), and particu- 
lar parameterizations of the Poisson and negative binomial models for count data (see 
Section 23.7.3). 


Mean-Differenced Transformation 


For some models of the conditional mean with additive or multiplicative effects, the 
individual effects œ; can instead be eliminated by use of an appropriate differencing 
transformation. This leads to moment conditions that can be used for method of mo- 
ments or GMM estimation as detailed in Section 23.2.6. 

The mean-differenced transformation generalizes the within transformation for 
the linear model given in Section 21.2.2 that eliminates œ; by subtracting individual- 
specific means. It requires strongly exogenous regressors, see (23.7). 

For the additive effects model defined in (23.3) with strongly exogenous regressors 


El(yir — Yi) — (8(%;,B)—8i(B))IXi1, <- Xir] = 0, (23.11) 


where 3;(3) = T7! Si g(x; B) and the result uses E[J;|Xx;1, -- - , Xir] = aj + 8:(6). 
For linear models (23.11) simplifies considerably as then g(x}, 6) — 8:(6) = (Xi — 
Xi) B. 


For the multiplicative effects model defined in (23.4), some algebra leads to 


XB)  _ | 
E Hi SEAT X VilKi1,---, Xi = 0, (23.12) 
| t ZB) yilXi1 T 
using E[¥;|xi1,..., Xir] = @;g;(G). For simplicity we call this a mean-differenced 


transformation, though strictly speaking it is a quasi-difference. It is also called a 
(conditional) mean-scaling transformation, as equivalently 


Yi 1 

E| vir — — 5 8(%),B)1Xi1, -o Xi |=0. 
b ET S E l 
First-Differences Transformation 


The first-differences transformation generalizes the first-difference transformation 
for the linear model given in Section 21.2.2 that eliminates œ; by subtracting the model 
lagged one period. We assume regressors are weakly exogenous (see (23.6)). 

For the additive effects model, 


Elir — Yir-1) — (8(%},8)—8(%;, 1) Ki, -+ X21] = 0, (23.13) 
where we have used E[y; ;—1|Xj1,.--, Xi2-1] = @; + g(x; ,_ 1). 
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For the multiplicative effects model defined in (23.4), 


8(X;,3) 
E = AE X Yis-1lXi1 ++. Xir-1 | = 0, (23.14) 
where we have used E[y;,+-1|Xi1, -< Xi2—-1] = ai g(X; 18). For simplicity we call it 


a first-differences transformation, though strictly speaking it is a quasi-difference. 

The first-differences transformation relies on weaker assumptions, conditioning 
only up to period t. It permits estimation of dynamic models, extending Section 22.5 
to nonlinear models. For dynamic multiplicative effects Wooldridge (1997) and Cham- 
berlain (1992) actually proposed use of a variant of (23.14), 


E E7- 


ZB) Yit — Yi—1|Xi1; +++ xi = 0. (23.15) 


Dummy Variable Model Estimation 


If the incidental parameters problem is ignored, one can attempt to estimate all pa- 
rameters, including the individual-specific effects. Introduce a set of N dummy vari- 
ables dj ių equal to 1 if i = j and equal to 0 otherwise, and then jointly estimate the 
individual-specific parameters a, ..., @y along with the other model parameters. 

This estimator is computationally feasible, despite the very large number of param- 
eters owing to large N, but the resulting estimates of 8 and possibly ~y are in general 
inconsistent. Here we consider just parametric models, though similar points hold for 
conditional mean models. 

Thus consider the parametric individual-specific effects model defined in (23.1). 
Then the method is to obtain ML estimates of 3, y, and a@ = [a ... œn] that maxi- 
mize the full log-likelihood 


N 


InLre(8, 7,0) = >> Yrs Vit, da +x, B, Y), (23.16) 


i=l t= 


where d;; = [d1 i+ . . . dy itl’. The first-order conditions with respect to 6 = [3’ y] and 
Q are 


T 
do ain f (vir, d,a + x),, 7) /38 = 0, 


E 


Me 


i 


Samy ER Joes =0, i=1,...,N. 
t=1 


This estimator can be simple to compute despite the large number of parameters, N 
plus the dimension of 6. As detailed in Greene (2004b), the inverse of the Hessian 
is easily obtained by partitioning into ô and œ and applying the standard partitioned 
inverse formula, using the simplification that 0 InL(6, a)/da;da; =0 for j #i so 
that the inverse of the large N x N block corresponding to (œ, œ) is trivially obtained. 

In two special cases there is no incidental parameters problem. First, if yj; ~ 
Noi + xB, o°] then, from Section 21.6.4, the MLE for is the within estimator, 
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which is consistent for G even for finite T. Here the incidental parameters problem 
arises for estimation of o°? but not for 3. Second, for y;, ~ Plexp(q; + x,,3)] there is 
similarly no incidental parameters problem in estimating G (see Section 23.7.3). 

In general, however, there is an incidental parameters problem. The derivative with 
respect to œ; involves only T observations, rather than all NT observations. This usu- 
ally spills over to inconsistency of Bun. and 4, in short panels. It is possible that this 
inconsistency is moderate in panels that are not too short, such as T = 10 or T = 20. 
The simulation study of Greene (2004a) indicates that the nature and extent of bias 
vary considerably with the particular nonlinear model being studied. The development 
of methods that are reasonably robust in the presence of fixed effects, though still in- 
consistent in short panels, is an active area of research. 


23.2.3. Random Effects Models 


A random effects model treats the individual-specific effect œ; as a random variable 
with specified distribution and eliminates œ; by integrating over this distribution. Ran- 
dom effects are usually applied to parametric models. 


Parametric Models 


Suppose the ith observation y; has unconditional joint density f(y;|X;, æi, B , Y) given 
in (23.8), and the random effect has density 


a; ~ g(ai|n), (23.17) 


where g(œ;|ņ) does not depend on observables. Then the unconditional joint density 
for the ith observation is 


T 


FOX, B, Y, n) = | 1 f OitlXit, Hi, B, »| g(ai|nda;, (23.18) 
t=1 
where by unconditional we mean we no longer condition on a;. The random effects 
MLE of 8, y, and 7 maximizes the log-likelihood 


N T 
InLge(8, y. n) = X In ( | 1 fOilXin æi, J eto] (23.19) 
i=1 t=1 


In some cases an analytical expression for this integral is possible, basically if 
I], fQirla;) and g(a;) are conjugate pairs (see Table 13.2). Examples include normal- 
normal for linear regression, which yields normal, and Poisson—gamma for count data 
regression, which yields negative binomial. 

In most cases analytical results are not available, but numerical methods or 
simulation-based methods are likely to work well because the integral is only one 
dimensional. The usual approach is to choose /(y;;) to be the density that is thought 
to best fit the data in the absence of individual effects, and to let g(œ;) be the normal 
density. The integral is then a univariate integral with respect to a normal random vari- 
able. For small T the integral can be well approximated by Gauss—Hermite quadrature 
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(see Section 12.3.1), which approximates the integral with respect to a normal den- 
sity by a weighted sum. Butler and Moffitt (1982) provide a detailed exposition for 
the random effects probit model. Skrondal and Rabe-Hasketh (2004) use quadrature. 
Alternatively, repeated draws from g(q;) can be the basis for maximum simulated like- 
lihood estimation (see Section 12.4.2). 

The preceding discussion assumed independence over ¢ for given i. If instead y;, 
and y;, are correlated over i then it is more efficient to replace [], f(vitlXir, a7, B, Y) 
by f(yi|Xi, æi, B , Y) in (23.18) and (23.19). 


Random Coefficients Model 


The random effects approach can clearly be generalized to a random coefficients 
model, with random slopes as well as random intercepts, similar to the linear case in 
Section 22.8. 

The natural model is a _ single-index model with conditional density 
f (vit. X (B + @;), Y) or conditional mean g(y;r, xX (6 + a;)) and the univariate 
integral with respect to scalar œ; will become a multivariate integral with respect to 
vector a;, usually assumed to be normally distributed. 


Correlated Random Effects Model 


The key weakness of the random effects model is that it makes the strong assump- 
tion that the random effects are independent of regressors. To overcome this limitation 
Chamberlain (1980, 1982) proposed a correlated random effects model, for back- 
ground discussion see Section 21.4.4, that specifies 


Qi =X; T] +e + Xr Tr + Èi. (23.20) 


The likelihood above is then maximized with respect to B, y, m, and the parameters of 
the density of £. Unlike linear models this model leads to different estimator than that 
obtained using the simpler specification of Mundlak (1978) that 


ai = Xm + &. (23.21) 


The equation (23.20) can be viewed as an example of a hierarchical model. More 
general hierarchical models also permit random slopes, with estimation by classical or 
Bayesian methods. Section 22.8 presented details for the linear model. 


Finite Mixture Model 


The finite mixture model (see Section 18.5.1) provides an alternative model for the un- 
observed individual-specific effect. If there are m latent classes or types of individuals 
and for the jth type a; = a; then (23.18) becomes 


m T 
Sf (yi|Xi, B, Y, T) = > 1 S it |Xir, æj, B, »| Tj. 
j=l Lr=1 
This model is most often used for panel duration models (see Section 18.5.2). 
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23.2.4. Pooled Models 


The pooled model does not explicitly model individual-specific effects. It extends lin- 
ear pooled regression (see Section 21.5) to nonlinear models. 


Conditional Mean Models 


For conditional mean models the pooled model is 


ELyir|Xir] = gir, B), (23.22) 


for specified function g(-). 

The model (23.22) can be directly estimated by NLS, with inference based on panel- 
robust standard errors that control for conditional heteroskedasticity and for condi- 
tional correlation between y;, and y;,;. More efficient estimation is possible by model- 
ing the heteroskedasticity and correlation. Details are provided in Section 23.2.6. 


Pooled versus Random Effects Models 


What is the cost of ignoring individual-specific random effects? 

The additive effect model E[y;;|a;, Xi] =a; + 8(Xir, B) leads to (23.22) if 
E[a;|x;;] = 0. The multiplicative effect model E[y;;|o;, Xir] = a; g(x;;, B) implies 
(23.22) if E[a;|x;,] = 1. So the pooled model will lead to consistent estimation of 3 
in a random effects model if the effects are additive or multiplicative and the standard 
normalizations of the mean of a; for these models are used. 

Otherwise, the pooled model is unlikely to lead to the same parameter estimates as 
an individual-specific random effects model. For example, consider a probit random 
effects model with E[y;;|a;, Xi] = ®(a; + Xi B), where a; ~ N[O, og]. Then it can 
be shown that E[y;-|x;;] = P; B/V 1 + 02), which differs from the natural pooled 
probit model E[y;;|x;,] = ®(x;,’3). Unlike the linear model of Chapter 21, if the true 
model has individual-specific random effects than ignoring these effects can lead to 
inconsistent parameter estimates of 6. 

The statistics literature uses the pooled model approach extensively for panel 
versions of generalized linear models, such as binary data and count data. The re- 
sulting parameter estimates are called population averaged, as the random effects are 
averaged out. The approach is called marginal analysis, as E[y;;|x;,] is a model that 
is marginal with respect to the random effects. 


Parametric Models 
For pooled parametric models the starting point is usually 
F Oili) = f Yit Xb, V) (23.23) 


for specified density f(-). This model can be directly estimated by ML, with inference 
based on panel-robust standard errors that control for conditional heteroskedasticity 
and correlation (see Section 23.2.6). 
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In general the pooled parametric model estimates of 8 and ~y are unlikely to be 
consistent with those from a random effects parametric model. The arguments are 
similar to those for the conditional mean. 


23.2.5. Fixed versus Random Effects 


The essential result that random effects and pooled model estimators are inconsistent 
if individual-specific effects are present and are correlated with regressors still holds 
in nonlinear models. This favors use of fixed effects models on grounds of robustness, 
though there is a trade-off with loss of efficiency in estimation. A Hausman test can 
be used (see Section 21.4.4) to test whether a fixed effects model is needed, provided 
consistent estimation of the fixed effects model is possible. 

Other comparisons of fixed versus random effects models for linear models (see 
Section 21.4) require some adaptation for nonlinear models. 

Because of the incidental parameters problem, not all nonlinear models with fixed 
effects admit consistent parameter estimates. So fixed effects modeling is not always 
feasible. 

If consistent estimation of a nonlinear fixed effects model is possible then, unlike 
the linear case, the coefficients of time-invariant regressors can be identified. To see 
this consider the mean-differenced transformation for an additive effects model. For 
a linear model E[(y;, — yi) — (xi; — X;)'BlXi1, ---, Xir] = 0, with obvious problems 
for time-invariant regressors as then, considering the jth regressor, Xij — Xij = Xij — 
xij = 0. More generally, from (23.11) 


El(yir — Yi) — (8%;,8)—8i(B))IXi1, <- Xir] = 9, 


with no such simplification for nonlinear g(-) unless all K components of x;; are time- 
invariant. 

In fixed effect models with nonadditive effects it is not possible to predict changes 
in the dependent variable as regressors change. For the general model (23.2), the 
marginal effect 3 E[y;;|x;;, &i, B]/3Xit = 0g(Xiz, i, B)/0x;, depends on a;. 

The marginal effect can be measured in two special cases. For additive ef- 
fects (see (23.3)) the marginal effect is 0g(x;;, 3)/0x;;, which does not de- 
pend on q;. For multiplicative effects models (see (23.4)) the marginal effect is 
a;0g(X;;, 3)/0x;,. Then it is possible to measure the relative size of marginal effects for 
changes in different regressors. In particular, if ELy;,|x;;, «;, 8] = a;exp(x;,3), then 
(OELYir]/O-xi1j)/(OEL Vir] /OXitk) = Bj /Be- 


23.2.6. Estimation and Panel-Robust Statistical Inference 


The preceding analysis has concentrated on elimination of the incidental parameter 
a;. Now we detail estimation of model parameters, once a; has been eliminated for 
models with individual-specific effects. 

We assume a short panel and independence of observations over i. The dependent 
variable y; may be conditionally heteroskedastic and conditionally correlated over t 
for given i. The situation is similar to that in Section 21.2.3, except that nonlinear 
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estimators are used instead of simpler linear LS estimators. Standard statistical output 
that ignores this complication will lead to invalid inference. In the following we present 
expressions for panel-robust estimates of the variance matrix of parameter estimates. 
Alternatively, a panel bootstrap can be used (see Section 11.6.2). 


GMM Estimation 


Panel GMM estimation is appropriate for models based on the conditional mean. The 
key is specification of the moment condition that is the basis of GMM estimation. 
Following Section 22.2.1, a natural starting point is 


E[Zju;(6)] = 0,7 =1,..., N, (23.24) 


where Z; is a T x r matrix that depends on the regressors, u;(0) is a T x 1 residual 
vector, and @ is ag x 1 parameter vector 0. Different panel models lead to different 
specifications of u; and Z;. An example is given in the following. A key departure 
from Chapter 22 is that the residual u;(@) will be nonlinear in 0. 

If r = q then there are as many moment conditions as parameters to estimate and 
we can use the panel method of moments estimator Oym that solves 


ion ~ 
x X Zu @)=0. (23.25) 
i=l 


Using results in Section 6.10.3 on nonlinear systems estimation, we have that this 
estimator is asymptotically normal with variance matrix consistently estimated by 


N -l y -1 
7A] = È 5z | DAA È ZD a (23.26) 
i=l i=1 


where D; = 0u;/ 06'|5 and U; = u;(6). This yields panel-robust-standard errors in 
short panels. 

Ifr > q then GMM estimation is necessary, and we use the panel GMM estimator 
Osum that minimizes 


LA, ' P 
On(0) = È A] Wy È A] : (23.27) 


where Wy is an r x r weighting matrix. The asymptotic variance matrix for this es- 
timator can be obtained directly from results for the nonlinear systems IV estimator 
given in Section 6.10.4. Given the moment condition (23.24), the most efficient esti- 
mator uses Wy = [N7! >, ZOZ]. 

More efficient estimators are possible using alternative moment conditions. In par- 
ticular, if the starting point is a particular conditional moment condition then the op- 
timal unconditional moment condition for GMM estimation is given in Section 6.3.7. 
The GEE estimator given later follows this approach. A more general treatment is 
given in Avery, Hansen, and Hotz (1983) and Breitung and Lechner (1999). 
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GMM Example 


As a specific example, consider the first-differences transformation applied to the mul- 
tiplicative fixed effects model. The starting point is the conditional moment restriction 
(23.14). This leads to many unconditional moment conditions, one of which is 


gx; B) , 
E Xi Yit — SS X Yit- zw tE ha hot Se tee 
l { E 


Assume data on (yir, Xit) are available for (T +1) periods, with the initial 
period then lost because of first differencing. Stacking over T time periods 
yields (23.24) with Z; = [X;1, ..., X;r] and u; = [u;1, ..., uir], where uj, = Yi — 
[g(x;,8)/9(X; A) yit-1- Here Ziu; = }_, XirUit, So the method of moments estima- 
tor B solves 


Sef eB | 
x Ẹ = Ao n| =0. 


=1 t=1 


Clearly, additional moment conditions can be used, such as E[x;;_1ui;] = 0, leading to 
an overidentified model and estimation by GMM. This was discussed extensively for 
the linear model in Section 22.2. 


Generalized Estimating Equations Estimation 


The pooled model for the conditional mean specifies ELy;;|x;,] = g(x;+, B) (see Sec- 
tion 23.2.4). This model can be estimated by GMM methods already given. Here we 
go further and consider efficient GMM estimation. 

Stacking over all T observations gives conditional moment condition 


Ely; — g;(G)|Xi] = 0, (23.28) 


where g;(3) = [g(xi1, B), ..., g(%ir, OI and X; = [xi1,..., Xir]. The optimal un- 
conditional moment condition to use in estimation is then 
0g’ 
p| 28: 
Jg 
a result obtained by applying the general result given in Section 6.3.7. This leads to 
the generalized estimating equations estimator Bggg that solves 


{V LyX}! Y: — 2(9)| =0, (23.29) 


N / 

5 BO ry — g(8)) = 0, (23.30) 

1 98 
where X; is a working variance matrix for Viyil [X;]. The asymptotic variance matrix 
of Bors i is given by (23.26) with u; = y; — g8) and Z; = əg;(6)/3B|3 x $. This 
variance estimate is panel-robust and is also robust to misspecification vs yi. 

The GEE estimator, due to Liang and Zeger (1986), is widely used in the statistics 

literature for panel versions of generalized linear models. Different GLMs correspond 
to different conditional mean functions g;(3) and working variance matrices &;. 
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ML Estimation 


For likelihood-based models the starting point is the joint density for all T individuals, 
f (yi |X;, 9). For pooled parametric models 6’= [8', -y’] (see (23.23)), and for random 
effects parametric models 6’= [3’, y’, n'] (see (23.18)). 

The standard approach is to let f(y;|X;,0)= [hs SF Ot Xi, 0), where 
F (vit |Xiz, O) is the density for the (i, t)th observation. The implicit assumption of inde- 
pendence over t for given i is usually unwarranted, especially for pooled models that 
do not include a random effect that permits some correlation over time. Nonetheless, 
consistent estimates of 0 are obtained even if f(y;|X;, 0) is misspecified, provided 
J (vit Xi, 8) is correctly specified. A sandwich form should then be used for the es- 
timator variance matrix to ensure panel-robust standard errors. The MLE is strictly a 
quasi-MLE, with detailed discussion given in Section 5.7.5. More generally, this ap- 
proach is an example of inference with clustered data (see Section 24.5). 

More efficient estimation is possible using a richer model for f(y;|X;, 0) that ac- 
commodates correlation over time. However, nonnormal multivariate distributions for 
y; can be restrictive or difficult to work with. For pooled GLMs the GEE estimator can 
be used instead. 


23.2.7. Dynamic Models 


Dynamic models with individual-specific effects are of considerable interest as they al- 
low one to distinguish between true state dependence and spurious dependence caused 
by unobserved heterogeneity (see Section 22.5.1). 

For nonlinear models it is not always obvious how to include lagged dependent 
variables as regressors, since for some types of data there is not always a standard pure 
time series model. This is illustrated in Section 23.7.4 for the Poisson model. Once an 
appropriate specification is determined, the standard fixed effects estimators become 
inconsistent and random effects estiamtors need to incorporate initial conditions, as 
was the case for the linear panel model. 


Pooled Models 


The pooled model ignores random effects and estimates the usual cross-section model 
where the regressors now include lagged dependent variables. The discussion in Sec- 
tion 23.2.4 is again relevant. 


Fixed Effects Models 


For fixed effects models the issues are similar to those presented in Section 22.5. The 
regressors are now weakly exogenous rather than strongly exogenous. The usual fixed 
effects estimators are inconsistent. 

For models with additive effects or multiplicative effects consistent estimation is 
possible using the first-difference transformation (see Section 23.2.2) and higher lags 
of the lagged dependent variable as an instrument. For additive effects models this 
leads to a nonlinear version of the Arellano—Bond estimator given in Section 22.5.3. 
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For multiplicative effects the first-difference transformation is detailed in Section 
23.7.4. For dynamic logit with fixed effects see Section 23.4.3. 


Parametric Random Effects Models 


For parametric random effects models initial conditions on the lagged dependent vari- 
able matter. Usually there is no satisfactory treatment, so the estimates are inconsistent 
in short panels with inconsistency that declines as T gets larger. 

Consider the simplest case where only the first-period lag appears in the model, 
so the regressors x;, become regressors x;; and y;;_1. The random effects density 
(23.1) becomes f(Yit|Yit-1, Xit, Qi, 6) for t = 2,..., T. However, a similar model for 
yi} cannot be included because yjo is not observed. One approach treats y;; as ex- 
ogenous, so that we model the conditional distribution for only T — 1 observations 
Yit, - ++, Yi2- An alternative approach presumes a static model for y;; that depends on 
regressors x; and possibly on the marginal effect œ;. Then the joint conditional density 
of y; is 


fixi, <--> XiT, Qj, 6, 61, y) 


T 
=f i J Vit |Vir-15 Xit» Qi, J Ai (Via lXi1, i, 61) (ai |Y)daq;, 
t=2 

rather than (23.18), where fiı(yii|X;1, &i, 61) is the assumed density for the first 
observation. 

In pure time series analysis initial conditions become irrelevant asymptotically, 
since T — oo. In short panels, however, they become very important as T is small 
and asymptotics instead use N — oo. 


23.2.8. Endogenous Regressors 


The treatment for endogenous variables in nonlinear models is similar to that in the 
linear case presented in Chapter 22. 

Panel GMM is the natural framework. The starting point is a conditional moment 
restriction E[u; (0)|Z;] = 0 for appropriately defined residual u;(@) and instruments 
Z;. This leads to unconditional moment condition (23.24) that is the basis for GMM 
estimation. Possible candidates for instruments can include exogenous regressors from 
periods other than the current one, as detailed in Sections 22.2 and 22.4 for the linear 
model. 


23.3. Nonlinear Panel Example: Patents and R&D 


We model the relationship between patents and R&D expenditures, using U.S. data 
on 346 firms for each of the five years 1975-1979 from Hall, Griliches, and Hausman 
(1986). The dependent variable y;; is Patents, defined as the number of patents applied 
for during the year that were eventually granted. For simplicity we consider just one 
explanatory variable x;;, real R&D spending during the year (in 1972 dollars). 
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Pooled (overall) regression 
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Figure 23.1: Patents and R&D spending: pooled (overall) regression. Natural logarithm of 
patent applications leading to award plotted against the natural logarithm of R&D spending 
for 346 firms in each of the five years 1975-79. Zero patents recoded to 0.5 patents. 


An obvious starting model is a log-log model, with E[In y;,|x;;] = a; + £ ln xir, 
since then 6 equals the Patents-R&D elasticity. This model cannot be applied here, as 
Yi = 0 for a considerable number of observations and In 0 is not defined. An ad hoc 
adjustment is to recode y;; = 0 as y; = 0.5 before taking logs. 

Figure 23.1 provides a plot of the adjusted In (Patents) against In (R&D), along with 
fitted OLS (with an estimated slope coefficient of 0.834) and nonparametric regression 
curves, using data for all firms in all years. Patents clearly increase with R&D expen- 
diture. Panel data analysis, particularly fixed effects models, can separate this rela- 
tionship into cross-section and time-series components. Note that Patents vary greatly 
across observations, particularly across firms, with a mean of 36.3, a standard deviation 
of 74.5, and a range of 0 to 608 over all years and firms. 

We estimate a multiplicative individual-effects model for the conditional mean with 


Elyirlxir, æi] = aj exp(B In xi) = exp(y; + B In xit), (23.31) 


where y; = Ina;. Then £ directly estimates the Patents-R&D elasticity, since (23.31) 
implies 0 In E[y;+|x;;]/0 In xj; = 6. Unlike the log-log model, zero values for y;; cause 
no problems. 

A richer parametric model recognizes that the dependent variable is a count. A 
starting point is a Poisson model 


VirlXirs Yi ~ Plexp(yi + B 1n xi)]. (23.32) 


This model, detailed in Section 23.7, has the same conditional mean for y;, as that 
given in (23.31). 

Table 23.1 presents a number of estimators for these data. All estimators are con- 
sistent under the assumption that the conditional mean is given by (23.31) with a; a 
random effect that is independent of x;, and has constant mean. All estimators ex- 
cept the last are inconsistent under the assumption that œ; is instead a fixed effect that 
is correlated with x;,. Three standard error estimates are provided: program defaults, 
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Table 23.1. Patents and R&D Spending: Nonlinear Panel Model Estimators“ 


NLS Poisson GEE Poisson—RE Poisson—-FE 
y =Ina 2.529 1.712 2.068 2.313 - 
B 509 693 560 349 —0.038 
Panel se? (.055) (.043) (.033) (.033) (.033) 
Boot se [.054] [.047] [.107] [.119] {.107} 
Usual se {.011} {.002} {.004} {.033} {.033} 
Sum 6 - 486 .460 546 313 
N 1730 1730 1730 1730 1620 


a 


b 


Shown are pooled NLS, pooled Poisson, pooled GEE, Poisson Random Effects (RE), and Poisson Fixed 
Effects estimates for the nonlinear panel regression (23.31) of In(Patents) on In(R&D). Standard errors for 
the slope coefficients are panel robust in parentheses, bootstrap in square brackets, and usual estimates that 
assume iid errors in curly braces. The second to last row gives the sum of £ coefficients in an expanded model 
with up to five lags of In(R&D) as regressors. 

se, standard error 


panel-robust estimates (where available), and bootstrap estimates (without refinement). 
The details for each column are as follows: 


Pooled NLS: The NLS estimates in the first column estimate (23.31) with a; = a by 


NLS (see Section 5.8). The default standard error of 0.011 assuming iid errors is 
much smaller than the correct panel-robust standard error estimate of 0.054. 


Pooled Poisson: The Poisson estimates in the second column are for the Poisson 


model (23.32) with a; = œ estimated by the Poisson MLE assuming indepen- 
dence over i and t. The estimated elasticity is 0.693 compared to the NLS esti- 
mate of 0.509. The default standard error of 0.002 imposes the Poisson restric- 
tion of variance—mean equality (see Section 20.2.2). Correcting for overdispersion 
using the sandwich variance matrix estimate (see also Section 20.2.2) increases 
the standard error estimate to 0.020 and emphasizes the importance of control- 
ling for any overdispersion in count data. Additionally controlling for correlation 
over t for given i leads to an even higher panel-robust standard error estimate 
of 0.043. 


Pooled GEE: The pooled GEE estimator solves (23.30), where g(x;;, B) is given by 


(23.32) with a; = a. The particular specification of the working matrix X; used 
here is given after (23.55). The estimated elasticity is 0.560 with standard error of 
0.033 using the panel-robust estimate discussed after (23.30). 


Poisson—RE: The Poisson random effects estimator assumes that a; = In y; is gamma 


distributed (see Section 23.7.2). The estimated elasticity is 0.349 with default stan- 
dard error of 0.033. 


Poisson—-FE: The Poisson fixed effects estimator assumes that a; = In y; is a fixed 


effect, and it is estimated as in Section 23.7.3. The estimated elasticity of —0.038 
is now negative, with default standard error of 0.033. For the Poisson fixed effect 
model, firms with }°, yi; = 0 are dropped, leading here to a loss of 22 x 5 = 110 
observations. 
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There is a big difference between fixed and random effects results, favoring fixed 
effects estimation. The surprising negative estimated elasticity with FE arises because 
the model is too simple. In particular, R&D pedis affects patent activity with 
a lag. Replacing B Inx;, in (23.31) and (23.32) by ye o ÊL INX; s; leads to estimated 
elasticity 5 7—0 B ı given in the second last row of Table 23.1. The FE estimate of 0.313 
is still less than the other estimates, but the difference is now reduced. 


23.4. Binary Outcome Data 


We consider a binary outcome in which y;; takes only the values 0 and 1. For example, 
data may be available on whether or not an individual is employed in each of several 
time periods. A key result is that fixed effects estimation is possible for the logit model 
but not the probit model. 


23.4.1. Individual-Specific Effects Binary Models 


The natural extension of the binary outcome model from cross-section data (see Sec- 
tion 14.3) to panel data with individual-specific effects is to specify that y;, takes only 
the values 0 and 1, with 


F(a; +x,,3) in general, 
Priyir = LXi, Bi] = | Al; +x; B) for logit model, (23.33) 
(a; + x’, 3) for probit model, 


where F(-) is a cumulative distribution function, A(-) is the logistic cdf with A(z) = 
e*/(1 + e), and ®(-) is the standard normal cdf. Given (23.33) and assuming condi- 
tional independence, the joint density for the ith observation y;= (yi1, ..-., Yir) iS 


T 
f(yi|Xi, «i, B) = I] Flai +x, BA — Flai +x),8))' (23.34) 


t=1 


For binary data the conditional probability is also the conditional mean, so 
E[yir loi, Xi] = F(a; +x; p). (23.35) 


This is a single-index individual-specific effects model (see (23.5)) that does not sim- 
plify to either an additive or multiplicative effects model. Additive and multiplicative 
effects models are not appropriate as they do not restrict the conditional mean and 
conditional probability to lie between zero and one. 

Binary panel models emphasize the parametric model (23.34), since binary data 
must be Bernoulli distributed. The conditional mean model (23.35) is rarely used, 
though it is natural to use this if regressors are endogenous. 


23.4.2. Random Effects Binary Models 


The random effects MLE assumes that the individual effects a normally dis- 
tributed, with a; ~ N[O0, o 2; The random effects MLE of G and oÈ maximizes the 
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log-likelihood X>% In f (y;|X;, B, 02), where 


ee 2 
f (yiIXi, B.o) = f f(yilXi, æi, B) =) dai, (23.36) 


1 
Trae ie 
where f(y;|X;, œi, B) is given in (23.34) with F = A for the logit model and F = ® 
for the probit model. There is no closed-form solution for the integral (23.36) and it is 
standard to compute it numerically using quadrature methods. 

If fixed effects are not present, then an alternative to the random effects model is 
a pooled binary model that simply specifies that Pr[y,, = 1|x;;] = F(x; 3). Statistical 
inference should then be based on panel-robust standard errors (see Section 23.2.6). 
More efficient estimation is possible using a GMM approach (see Avery et al., 1983) 
or a GEE approach (see Liang and Zeger, 1986). 


23.4.3. Fixed Effects Logit 


Fixed effects estimation is possible for the panel logit model, using the conditional 
MLE, but not for other binary panel models such as panel probit. 

For the logit model performing some algebra given in Section 23.4.5 yields that the 
joint density of y;= (vii, ..-, Yir) is 


exp (a; Dex Yir) exp (£; VisXir) p) 
IL, [1 + exp(a; + x)|] i 


This depends on œ;, which we need to eliminate. For observation i there are J`, yi 
outcomes of 1 in the T periods. Define the set B. = {d;| $, dir = J., Yir = c} to be 
the set of all possible sequences of Os and 1s for which the sum of T binary outcomes 
XL, Yir = c. Then if we condition on >>, yjr = c it is shown in Section 23.4.6 that œ; is 
eliminated and 


Syilai, xi, B) = 


(23.37) 


exp ((°, yix) B) 
Dace, exp (X, dixi) B) 


a result due to Chamberlain (1980). The density (23.38) is the basis for conditional 
ML estimation. The only complication is that there are many sets B, and sequences 
within these sets, as we now detail. 

First, it is not possible to condition on )°, yj, = 0, since this can only occur if all 
Yit = 0, and similarly for `, y; = T. This can mean considerable loss of observations 
if, for example, most people are employed in all periods. 

As an example where conditioning works, suppose T = 2 and J`, y;, = 1. Then ei- 
ther the sequence {0, 1} or {1, 0} is possible, and the conditional probability in (23.38) 
implies that, for example, 


FQ do Ya =c, xi, B) = (23.38) 


exp (x p) 
exp (x; B) + exp (x126) 
ASAR (xii = Xio) B) 
1 + exp ((xi1 — Xi0)'B) 


Pr[yiy = 0, yiz = lly + yi2 = 1] = 
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If T =3 then we can condition on )°, yi; = 1, with possible sequences {0, 0, 1}, 
{0, 1, O} and {1, 0, O}, or on J`, Yir = 2, with possible sequences {0, 1, 1}, {1, 0, 1} 
and {1, 1, 0}. Clearly for large T there are many sequences and the conditional density 
can get complicated. 

The conditional density is that of a conditional logit model, where parameters are 
invariant but regressors vary over alternatives. The number of alternatives varies across 
individuals, where for individual i each alternative is a specific sequence of Os and 1s 
that sum to J`, y;;. It is easiest to use computer code specifically set up for this prob- 
lem. Even then there can be a large number of alternatives. For example, if T = 10 and 
>, Yu = 5 then there are 252 alternatives. Consistent but less efficient estimation is 
possible by dropping some observations, such as for individuals with many alternatives 
because of a high }_, yiz, or by reducing the number of time periods. 

The elimination of the individual-effects œ; makes it impossible to interpret re- 
gression coefficients using the original model (23.37). Instead, we use the conditional 
model (23.38). For example, suppose we have single regressor and 6 = 0.2. Then if 
we consider two time periods and condition on `, yi; = 1, then 


exp(B(xi1 — X10)) 
1 + exp(A(xi1 — x10) 


Priya = 0, yi2 = Ilya + y2 = = 


It follows that a one-unit difference in x;; versus x;2 leads to a conditional probability 
of this sequence being exp(6)/[1 + exp(8)] compared to a probability of one-half if 
Xil = Xi2- 


23.4.4. Dynamic Binary Models 


Suppose we have a pure time series first-order Markov logit model with no regressors 
other than the lagged dependent variable: 


exp(a; + YYir—-1) 


Pr it — 1 Qi, Yit—-1] = c 23.39 
[Vir lo, Yir-1] 1+ expla; + yya) ( ) 
Then performing some algebra given in Section 23.4.5 gives 
Tal exp (y Ea YuYu—ı) 
; (23.40) 


S (Yil Yin, YiT, a Yit, Y) = 
1=2 Ž aec, XP (y Ee divdis-) 


where the set C; = {d;|y;1, Yir, X2, dir = }_, Yir} is the set of all possible sequences 
of Os and 1s for which the sum of T binary outcomes is }°, y;z, the first outcome is 
yi1, and the last outcome is y;r. 

Conditional ML estimation based on (23.40) leads to a consistent estimate of y. 
The minimum number of time periods needed is four. For example, if y; is the se- 
quence {0, 1, 0, 1} then the set C; is composed of the sequences {0, 1, 0, 1} and 
{0, 0, 1, 1}. The approach is due to Chamberlain (1985), who actually considered 
a second-order Markov model. Chay, Hoynes, and Hyslop (2001) apply this method 
to California administrative data on welfare spells and find that, after controlling for 
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unobserved individual heterogeneity, there remains true state dependence in welfare 
participation. 

The preceding results and discussion apply to pure time series models. Honoré and 
Kyriazidou (2000) provided a method that allows regressors other than the lagged 
dependent variable. Thus let (23.39) become 


expl; + X, B + YYir-1) 
1 + explæ; + x BHY Yir-1) 


Priyir = Loi, Yir-1, Xi] = (23.41) 
Consider four time periods and consider sequences with common binary outcomes in 
the first and fourth periods, say dı and d4. Then the probability that the sequence is 
{d,,0, 1, d4}, given that it is either {d,, 0, 1, d4} or {dz, 1, 0, d4}, now depends on a;. 
However, the dependence on œ; disappears if x3; = x4;. Since few observations have 
X3; = X4;, especially with continuous data, Honoré and Kyriazidou (2000) propose 
use of kernel smoothing methods with kernel weights that depend on (x3; — x4;). Chay 
and Hyslop (2000) provide an application that implements this method and many other 
methods for dynamic binary data models. 


23.4.5. Multinomial Models 


The fixed effects estimator can be generalized to the multinomial logit model, since 
this model implies a binary logit model for pairwise comparison of alternatives (see 
Section 15.4.3). For static models Chamberlain (1980) provides a brief exposition and 
M.-J. Lee (2002) provides more details. Magnac (2000) provides a quite detailed em- 
pirical application to individual transitions among six different states in the French 
labor market using dynamic fixed effects logit models with no regressors other than 
lagged dependent variables. Honoré and Kyriazidou (2000) consider the multinomial 
logit model. 

For other multinomial models a random effects approach is necessary. These mod- 
els, such as mixed logit and multinomial probit, are complicated to estimate even in 
the cross-section case. For details see Train (2003). 


23.4.6. Derivations for Fixed Effects Logit 


For simplicity suppress the subscript i. For the logit model the joint probability of 
y = (1, .--, Yr) given in (23.34) becomes 


E L expla + x6) \” 1 1—y: 
POS I] h + exp(a@ + 25) (; + exp(a@ + z5) Pra 


exp (©, ya + x/B)) 

~ TI, [1 + expe + x,3)] 

_ exp (a Da yr) exp (Oy y:X;) B) 
7 II, [1 + exp(@ + x/B)] 


> 


which yields (23.37). 
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The quantity }_, y; can be shown to be a sufficient statistic for œ as follows. Suppose 
we have an observation for y such that $`, y; = c. Define the set B. = {d| 5°, d; = c} 
to be the set of all possible sequences of Os and 1s for which the sum of T binary 
outcomes is c, and condition on )°, y: = c. Then 


Prly, X, y: = cl 


f(y 2 y=0c)= Prd, =a (23.43) 
Pr[y] 
PH, Sl 
Pr[y] 
~ Yaes, Prid] 
_ __exp((X, yX) 8) 
= Ž acg, XP (È, dix;) p) i 


where the first equality uses Bayes’ rule, the second equality uses the fact that knowl- 
edge of `, y, does not add anything given knowledge of y, the third equality uses 
the fact that Pr[}_, y; = c] equals the sum of the probabilities of combinations of Os 
and 1s that equal c, and the fourth uses the previous definition of f(y) and consider- 
able simplification that in part relies on J`, y; = >, d; when we restrict attention to 
d € B.. 

Now consider the dynamic model. Replacing x; 6 in (23.42) by yy;_1 yields 


(arty) ae oy) 
Tl, [1 + expe + yy-1)] 
Exp (o Ei vi) exp (Es yyy) 
[1+ exp(a) E=- [1 + exp(a + y= Y= 
exp ( Dia vi) exp (Ziza 791-194) 


rano OnE ap n 


f) = 


where the second equality uses the fact that y,_; is either O or 1 and follows after 
some algebra, and the last equality uses DA Y1 = Yı — yr + D y,. The algebra 
is then similar to (23.43) except that in addition to conditioning on va yı we also 
need to condition on yı and yr that appear in the denominator. Equivalently, we can 
condition on SZ 1 y: and yı and yr. This yields 


exp (es yyy) 


Li Ž aec, €XP (E ydi-ıdi) 


where C = {d|d, = yı, dr = yr, S di = yi y;} is the set of all possible se- 
quences of Os and 1s for which the sum of the T binary outcomes is }_, y, the first 
outcome is y1, and the last outcome is yr. 
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23.5. Tobit and Selection Models 


We consider censoring, truncation, or selection when panel data are available, rather 
than data on a single cross-section. 

A pooled analysis simply mirrors analysis in the cross-section case, with the adjust- 
ment that panel-robust standard errors should be computed (see Section 23.2.8). For 
example, see Grasdal (2001) who considers selection resulting from panel attrition. 

Here we focus instead on panel models with individual-specific effects. Then ran- 
dom effects models can be estimated, if the strong assumption of a purely random ef- 
fect is warranted, the only complication being that of numerical computation. There are 
no simple consistent estimators for fixed effects models, however, in the usual microe- 
conometric setting of a short panel. More complicated semiparametric estimators that 
permit fixed effects in Tobit and generalized Tobit models are given in Section 23.8. 


23.5.1. Censored and Truncated Models 


For cross-section data the censored Tobit model is given in Section 16.3.1. A panel 
version with additive individual-specific effect specifies 


Yi = Oj +X, + Eir, (23.44) 
where e; ~ N’[0, oĉ], and we observe y; = y% if y% > 0 and y; = 0 or is observed 
to be missing if y% < 0. The joint density for the ith observation y;= (yi1, ---, Yir) 
can be written as 

T it 
f (yiIX:, B0) =] | IE -o| [1 D, (23.45) 
t=1 


where $i, = @((yir — i — x, B)/0:), Dir = Dla; +x, B)/0;), and $C) and $C) de- 
note, respectively, the standard normal pdf and cdf. 

The fixed effects MLE maximizes the log-likelihood based on (23.45) with respect 
to B, a. and a;,...,@y. In short panels the resulting estimator of 8 is inconsistent 
because of the incidental parameters problem, and there is no simple differencing or 
conditioning method that can provide a consistent estimator. Heckman and MaCurdy 
(1980) applied the fixed effects MLE to female labor supply. Although recognizing the 
inconsistency of the estimator, they argued that with T = 8 the inconsistency may not 
be too great. Greene (2004a) provides a recent Monte Carlo study for the fixed effects 
Tobit MLE. 

Random effects estimation is more commonly used because of inconsistency of the 
fixed effects Sua une the assumption that a; ~ N [0, og] the random S 
MLE of 6,02, and o maximizes the log-likelihood JDM In f(yi IX, B, 0 
where 


1 
f (yiIXi, B, 02,02) Papa yilX;, a, B, 0 0!) sone (5 = dai, (23.46) 


for f(yi|X;i, æi, B, oÊ) given in (23.45). This one-dimensional integral can be com- 
puted using Gaussian quadrature. 


o? 3 og) 
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This approach can be extended to other models with censoring or truncation. For 
example, a right-censored version of the Poisson random effects model in Section 
23.7.2 may be used if, say, counts above 10 are recorded only as 10 or more. 

There are two weaknesses to the fully parametric approach. First, as in the cross- 
section case reliance on distributional assumptions becomes much greater when there 
is censoring or truncation. Second, the assumption of purely random effects indepen- 
dent of regressors may be too strong. 


23.5.2. Selection Models 


Selection models can arise in panel data for the same reasons as in the cross-section 
case (see Section 16.5). A generalization of the Tobit type 2 model in Section 16.5.1 
to a linear panel model with individual specific effects A; and ô; is 


Vi, = Qi +X}, B+€it, (23.47) 
dj, = 6; + ZY+ Vir, 


where y;; = y% is observed if dř% > 0 and y,, is not observed otherwise. 

For the random effects formulation the four unobservables are assumed to be nor- 
mally distributed. Hausman and Wise (1979) proposed ML estimation, which involves 
a bivariate integral as a; may be correlated with ô; and ¢;, may be correlated with vj;;. 

The fixed effects estimator is inconsistent in short panels. Note, however, that if 
d;, = 6;, so that selection is due only to time-invariant characteristics of the individual, 
which may be observed or unobserved, then the fixed effects estimator in the model 
Yit = Qi + X; B + £ir is consistent. A fixed effect panel model controls for sample se- 
lection, to the extent that it depends on time-invariant characteristics. 

Verbeek and Nijman (1992) provide a more detailed discussion of the essential as- 
sumptions needed for consistent estimation in these model and propose tests for selec- 
tivity bias. Wooldridge (1995) provides a similar analysis under weaker assumptions 
and presents assumptions that may not be too restrictive in some applications that per- 
mit consistent estimation of the fixed effects model. Vella (1998) provides a review 
and additional references. 

The methods for sample selection can be extended to panel attrition (see Sec- 
tion 21.8.5) that leads to attrition bias if observations on the dependent variable are 
lost in a nonrandom manner. Then all data for the itth observation are not observed 
if d] < 0, so Zi; in (23.47) needs to be replaced by variables observed in periods 
other than period t. An early example is Hausman and Wise (1979), and a more re- 
cent application is Grasdal (2001). Baltagi (2001) and Hsiao (2003) provide further 
references. 


23.6. Transition Data 


For concreteness consider panel data on welfare spells. Great interest lies in measuring 
individual persistence in welfare spells, and determining the extent to which this is due 
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to true state dependence rather than differences in individual propensities to be on wel- 
fare. Since individual propensities may depend in part on unobservables, models with 
individual-specific effects should be used. For duration data there exists an unusually 
wide range of modeling approaches, as several types of panel data on transitions are 
possible. Here we focus on fixed effects models. 

Data may be available on whether or not an individual is in a state at several points 
in time, such as on welfare. Then one can use a binary panel model (see Section 23.4), 
such as the dynamic fixed effects logit model. 

Richer data provide information on the durations of several individual spells. A 
natural starting point is then a panel proportional hazards model 


Atij lxi) = Aj (tj, Yj) EXP; Bai, (23.48) 


where t;; is the completed spell duration for the jth spell of the ith individual and œ; is 
an individual-specific effect. This is the mixed proportional hazards model, discussed 
for single-spell data in Chapter 18. The conditions for nonparametric identification of 
the MPH model with only single-spell data (see Section 18.3) include the assumption 
that œ; are distributed independently of the regressors. This rules out fixed effects. 
Once multiple spells become available, however, Honoré (1992) showed that œ; can 
be a fixed effect if x;; is constant over j (see Section 19.4.1). For further discussion of 
the model (23.48), including a dynamic duration model with hazard function for the 
second spell dependent on the duration of the first spell, see Section 19.4.1. 

Chamberlain (1985) presented several approaches for elimination of œ; in various 
panel duration models. For the MPH model, with baseline hazard À ;(-) the same across 
spells j, the probability that the second spell is longer than the first spell does not 
depend on @;. Conditional ML can be applied to the gamma duration model, since the 
gamma is an LEF density. For Weibull, gamma and log-normal models the density of 
ti1/ti2 does not depend on a;. 

For more recent references and a detailed discussion, including sensitivity of 
multiple-spell data to censoring, see Van den Berg (2001). 


23.7. Count Data 


Hausman et al. (1984) presented estimable fixed effects and random effects models for 
both panel Poisson and panel negative binomial models. More recent work has empha- 
sized fixed effects in multiplicative effects models, permitting estimation of static and 
dynamic models under relatively weak distributional assumptions. 


23.7.1. Individual-Specific Effects Count Models 


We focus on the Poisson model, detailed for cross-section data in Section 20.2, though 
panel versions of negative binomial are also briefly considered. 

The Poisson individual-specific effects model specifies that y;, ~ P[a; exp(x;,3)]. 
Then, assuming conditional independence, the joint density for the ith observation 
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Yi= (Yil, ---, Yir) is 
T 
P(yiIX, a, B) = | [ expla expe, B-a; expt, BI" /Yin. (23.49) 
t=1 
A less parametric approach simply models the conditional mean as 
E[yir loti, Xit] = oj exp(X}, p) (23.50) 
= exp(yi + x; b). 


This is both a single-index individual-specific effects model and a multiplicative ef- 
fects model. Since it is a multiplicative effects model the individual effects œ; can 
be eliminated by mean differencing or first differencing. Note that the Poisson panel 
model (23.49) has conditional mean (23.50). 


23.7.2. Random Effects Count Models 


Assuming gamma-distributed random effects leads to a tractable solution for the 
marginal density of the random effects model. Assume œ; is G[nņ, n] distributed with 
mean 1, variance 1/n, and density g(a;|n) = not emen] T(n). Then (23.18) for the 
Poisson model (23.49) becomes 


ou n n 
f(yi1Xi, B. n) = TI a x (= i =) (23.51) 


=g; Yit 
r (©; vir +n) 
Àit o paaa ai 
: = ) TO) 


where À; = exp(x;,6) and derivations are given in Section | 23.7.5. The resulting first- 
order conditions for the Poisson random effects estimator B can be expressed as 


N 
pat) 
i ir — À i = 0, 23.52 
X X x (ou ToL ner ( ) 


i=l t=1 


where A; = T~! >>, exp(x', 3). 

The term on the left-hand side of (23.52) has expected value zero if the mean con- 
ditional on regressors in all periods E[y;;|@;, Xi1, ..., Xir] = a; exp(x;, 3). So despite 
all the parametric assumptions made, the Poisson random effects estimator is con- 
sistent for G under the relatively weak assumption that the conditional mean is that 
given in (23.50) and that regressors are strongly exogenous. For the density (23.51), 
ELy;,|x;] = A;r and V[y;;|x;] = Ait + Reds so that overdispersion is of the NB2 form. 
A sandwich estimate of the variance matrix will permit more flexible models of 
overdispersion and conditional correlation. The first-order conditions for n (not given) 
are quite complicated though the information matrix is block diagonal in 8 and n. 

Several alternative estimators are available given random effects. First, the pooled 
Poisson estimator ignores the random effects and assumes yj; |xir ~ P[exp(x;,3)]. This 


803 


NONLINEAR PANEL MODELS 
has first-order conditions 


Xit (Vir — Air) = 0, (23.53) 


Ms 


i=l t=1 


where A;, = exp(x;, 3). This estimator is consistent if the conditional mean is (23.50) 
with E[q;|x;,;] = 1. Thus the usual cross-section Poisson MLE is consistent if the true 
model is one with multiplicative random effects. However, as illustrated in the Section 
23.3 example, panel-robust standard errors should be used. Here (23.26) yields 


-1 -1 
ViGieal = [Dam Yo Hitt sXiX;, Dan (23.54) 
i,t its i,t 
where Air = exp(x,,3), Dit = Yit — Airs J; denotes EN EZ] and Ži s denotes 
SL, ©. An alternative pooled estimator based on (23.50) is NLS, in which 
case (23.53) becomes J`; >), XirAit (Vir — Air) = 0. 

Second, more efficient pooled estimation may be possible using the GEE approach 
of Section 23.2.8, which introduces conditional correlation. The general result (23.30) 
for gir = Air = exp(x;, 3) becomes 


N 
YZE- Ai) = 0, (23.55) 
i=l 

where Z; isa T x K matrix with tth row observation 4;;x;,, and A; isa T x 1 vector 

with tth entry A;r. Several different working variance matrices X; for V[y;|X;] are 

possible. The choice X; = Diag[Aj;,] yields the pooled Poisson estimating equations in 

(23.53). Letting X; t = Ait and Ej ts = Ais = OV AitAis for s A t permits correlation 

over ¢ that is equicorrelated or exchangeable since the correlation is a constant @ 

fors #t. 

Third, more efficient pooled estimation may be possible using ML with the neg- 
ative binomial rather than the Poisson as the starting point. Suppose yj; is iid neg- 
ative binomial with NB2 variance function with parameters a;4;; and ¢; (see Sec- 
tion 20.4.1), implying y;,; has mean a;A;;/6; and variance (a@;A;;/¢;) x (1 + a; /9;). 
If (1+ a;/¢;)~! is a beta-distributed random variable with parameters (71, 72), 
then after some considerable algebra (23.18) reduces to 


T Àit + it ! 
FOIX: 6n) = (1 ae z) (23.56) 
. P+) (m +>, Air) r (m +>, Yir) 
LDE MT (m +m + Y, Ai + Oy Yiu) 


where ;, = exp(x;,3). This is the basis for ML estimation of 6, nı, and 7. This 
model relies on stronger assumptions than does the Poisson random effects model. 

Fourth, analysis need not be restricted to parametric models with closed-form so- 
lutions for f(y;|X;, 8,7). Crépon and Dugeut (1997a) use maximum simulated like- 
lihood methods to estimate hurdle and zero-inflated panel count models with joint 
normal random effects. 
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23.7.3. Fixed Effects Count Models 


The fixed effects estimator for the Poisson panel model (23.50) can be derived in sev- 
eral ways. 

First, the Poisson MLE simultaneously estimates 8B and a;,...,ay. The log- 
likelihood based on (23.49) is 


InL(B, a) = In m [[lexp aiir) (iri) si| (23.57) 


= eo [j-e So ait + Ina; Xoya F Xoyi In àis — Lias: >, 
i t t t t 


where À; = exp(x,,3). Differentiating with respect to œ; and setting to zero yields 
Qi = >°, Vir/ ><, Air . Substituting this back into (23.57) yields the concentrated like- 
lihood function. Dropping terms not involving G, we get 


In Leone(B) x D> >> In Air — yi In ba na) ; (23.58) 


It follows that for the Poisson fixed effects model there is no incidental parameters 
problem. Consistent estimates of 6 for fixed T and N — oo can be obtained by max- 
imization of In Leone(G) in (23.58). Differentiation of (23.58) with respect to G yields 
first-order conditions 


EE [mm gra) [E] 


which can be reexpressed as 


> XO xi (ou = +51) = 0, (23.59) 


i=1 t=1 


where Aj; = exp(x;,3) and A; = T7! $, exp(x;,); see Blundell, Griffith, and 
Windmeijer (1995). The Poisson panel model (23.49) and the linear panel model of 
Section 21.6 are unusual in that simultaneous estimation of G and œ provides consis- 
tent estimates of 8 in short panels, so there is no incidental parameters problem. 

Second, the conditional MLE eliminates the fixed effects by conditioning on a suffi- 
cient statistic for œ;. For the Poisson panel model this is `, y;,. Some algebra given in 
Section 23.7.5 shows that this leads to a conditional log-likelihood function that is pro- 
portional to the concentrated log-likelihood function given in (23.58). It follows that 
the conditional ML estimator for 8 in the fixed effects Poisson model solves (23.59). 
This was the original derivation of the Poisson fixed effects estimator of G by Palmgren 
(1981) and Hausman et al. (1984). 

Third, the mean-differenced transformation (23.14) for the multiplicative effects 
model (23.50) implies that E yir — (Ait /Mi) Vi |Xi15 xir] = 0, and hence 


EIX; Yir — (Air /Ai)¥i)] = 0. (23.60) 
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Using the corresponding sample moment conditions leads to an estimator 6 that solves 
(23.59). 

The same estimator has been obtained in three different ways. The third deriva- 
tion makes it clear that the essential assumption for the consistency of the Poisson 
fixed effects estimator is that regressors are strongly exogenous and (23.50) is cor- 
rectly specified. Inference should be based on panel-robust standard errors. In partic- 
ular, if the usual default ML or conditional ML output is used, following the first 
two derivations, standard errors may be considerably understated owing to failure 
to control for overdispersion in the count data. The fixed effects estimator leads to 
some loss of data, as observations i with $`, y; = 0 do not contribute to the sum 
in (23.59). 

Consistent estimation of 6 in the presence of fixed effects is also possible for a 
particular parameterization of the negative binomial model. Hausman et al. (1984) as- 
sumed that y; is iid NB1 with parameters œ;à;; and ¢;, where À; = exp(x;,3), so 
Yis has mean a;A;;/¢; and variance (@;;;/¢;) (1 + a;/¢;). The parameters a; and 
¢; can only be identified up to the ratio a; /¢;, and this ratio drops out of the condi- 
tional joint density for the ith observation, which after considerable algebra can be 
shown to be 


Tit + yit) ) (23.61) 


FOI... Yirl È Yin) = (IL, PAW On + D 
Po Aa) T (Ey +1) 
P(A + È, vir) 


This distribution for integer A;; is the negative hypergeometric distribution. The 
conditional ML negative binomial fixed effects estimator of B maximizes the log- 
likelihood function based on (23.61). The Poisson fixed effects model is more com- 
monly used since it is consistent under much weaker distributional assumptions. 


23.7.4. Dynamic Count Models 


There are several ways to bring dynamics into a count data model. Pure time se- 
ries models are surveyed in Cameron and Trivedi (1998). For simplicity consider 
inclusion of one lagged dependent variable. The obvious model is E[y;|y,-1, x;] = 
exp(y y;-1 + x/Q), but this can lead to explosive behavior as a result of exponentiation 
of y,_;. A more stable model may be obtained by instead using exp(y In y,_; + x’ 9), 
but this then runs into problems when y,_; = 0. For this reason an appealing model 
is the linear feedback model E[y;|y,;-1, X] = yy;-1 + exp(x/). The Poisson integer- 
valued AR(1) model has this property and in the pure time series case has correla- 
tion function Cor[y,, y;-«] = y“, similar to that for the AR(1) model (see Al-Osh and 
Alzaid, 1987). 

Thus Blundell, Grifffiths, and Windmeijer (1995, 2002) considered the dynamic 
fixed effects panel data model with 


Elyirloi, Yin—1, Xin] = YYi 1—1 + æi EXP(X,). 
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Applying the first-difference transformation (23.15) leads to conditional moment 
restrictions 


z HS 
exp(x;, 3) 


it — Pas D — Oi — Yia Yis «+++ Pitta Kids +++ x] = 0. 


These lead to many unconditional moment conditions (see Section 22.5.3 for a similar 
discussion for the linear model) that can supply the basis for GMM estimation as in 
Section 23.2.6. Crépon and Dugeut (1997b), Montalvo (1997), and Blundell, Griffith, 
and Van Reenen (1995, 1999) use similar quasi-differencing methods, with application 
to the Patents-R&D relationship. 

Bockenholt (1999) uses a more parametric model, estimating a Poisson integer- 
valued AR(1) model with unobserved heterogeneity modeled using a finite mixture 
distribution (see Section 18.5). 


23.7.5. Derivations for Random and Fixed Effects Poisson 


First, consider a random effects Poisson model with gamma-distributed random ef- 
fects. For simplicity suppress the subscript i and let 4, = exp(x; 6). The general for- 
mula (23.18) for the Poisson model (23.49) and random effects density g(a|y) yields 


FfOn- Yr) -f TI emery] g(aly)do 


ae 
z "yi! S perp ae yE i 
n a/y | xf (e : ) (aly da 


For g(a@;|n) = n"a""'e~*" / T(n), similar algebra to that in Section 20.4.1 yields the 
density given in (23.51). 

Second, derive the conditional density for the Poisson fixed effects model for obser- 
vations in all time periods for a given individual, where for simplicity the individual 
subscript i is dropped. In general the density of y1,..., yr given >, y+ is 


fOr... yrl doy) = FOL... yr L W/F OY) 
= FO W/O) 


BF (exp(— Lr) ir" / yr!) 
exp(— )), Hr) È, pr)" 4 / È; yı)! 
— POY] ee / TT y! 
expl X, HTT (©, Hs)" / (©; y)! 
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where the second equality arises because knowledge of >>, ys adds nothing given 
knowledge of y1, . . . , yr, the third equality specializes to y; iid P[j;] and hence $`, y; 
is PIX, Hr], and the fourth and fifth equalities simplify. The conditional density is 
that of the multinomial for }°, y; trials where the rth of T distinct outcomes occurs in 
any trial with probability ;/ >>, ws. Setting Mi: = a; exp(x;, 3) and taking logarithms 
yields conditional likelihood that is proportional to the concentrated log-likelihood 
given in (23.58). 


23.8. Semiparametric Estimation 


The semiparametric literature for panel data has emphasized models for limited de- 
pendent variables since, as for cross-section data, parametric assumptions become 
much more important when truncation, censoring, or selection are present. Attention 
focuses on models with fixed effects. We provide a brief summary. 

For binary data Manski (1987) extended his maximum score estimator from cross- 
section models to the panel model with fixed effects given in (23.33) where now the 
function F(-) is no longer specified. Although this estimator is consistent it converges 
at rate less than VN and is not asymptotically normal. 

For the Tobit model Honoré (1992) extended the censored LAD approach of Powell 
(1986a) to the panel fixed effects model (23.45) where the distribution of the error 
term ¢;, is unspecified. The data are artificially trimmed so that the fixed effect is 
subsequently eliminated by appropriate differencing. The estimator is VN consistent 
and asymptotically normal. 

For panel data with sample selection Kyriazidou (1997) considered the fixed effects 
version of the type 2 Tobit model (23.47) where the distribution of the errors ¢;, and 
vis is unspecified. She presented a Heckman-type two-step estimator. A smoothed ver- 
sion of the maximum score estimator of Manski (1987) eliminates the fixed effect in 
the selection equation, although a quite complicated differencing procedure is used in 
the second stage to eliminate the fixed effect in the outcome equation. This approach 
can be generalized to other generalized Tobit models. Charlier, Melenberg, and van 
Soest (2001) provide an application to a panel version of the Roy model or type 5 
Tobit model. 

Censoring is common in duration models. Section 23.6 focused on panel models 
with completed spells. When both complete and incomplete spells are observed for 
an individual, partial likelihood methods are inappropriate, since censoring is not in- 
dependent given presence of the time-invariant fixed effect. Horowitz and Lee (2004) 
propose a consistent estimator for the MPH model (23.43) with incomplete spells that 
does not require specification of the baseline hazard. 


23.9. Practical Considerations 


As was the case for linear models, if panel data are used then at a minimum infer- 
ence needs to be based on panel-robust standard errors. These are not provided by a 
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computer pregram for cross-section data unless it has an option for clustered standard 
errors, in which case clustering is specified to be by the individual. 

More efficient estimation is available using models that incorporate serial corre- 
lation. Econometricians emphasize random effects. Several packages fit models with 
normally distributed random effects, using Gaussian quadrature to integrate out the 
effect, as well as the more specialized analytically tractable random effects count data 
models. Statisticians instead emphasize the GEE approach for GLMs, available in 
many statistical packages and some econometrics packages. 

These preceding methods lead to inconsistent estimation if the random effect is 
correlated with regressors. Econometricians therefore emphasize the fixed effects ap- 
proach. Because of the incidental parameters problem, this yields consistent estimates 
in short panels for only a subset of nonlinear models. Econometrics packages are avail- 
able for conditional ML estimation of these models, the fixed effects logit and fixed 
effects count models. If a fixed effects model is infeasible then random effects models 
richer than the simplest iid random effects model might be used. 

Dynamic panel models can also be estimated. These permit distinction between 
persistence caused by unobserved heterogeneity and persistence caused by true state 
dependence. Implementation may require writing one’s own programs. 


23.10. Bibliographic Notes 


This chapter provides an overview of a vast and divergent literature and of necessity skips many 
details. The monographs on panel data by Arellano (2004), Baltagi (2001), Hsiao (2003), and 
M.-J. Lee (2002) provide considerable treatment of panel models for binary data and censored 
and selected models. Panel models for counts are presented in Cameron and Trivedi (1998) and 
M.-J. Lee (2002). Wooldridge (2002) presents panel methods for binary, censored, and count 
data. The statistical literature for various generalized linear models is summarized in Fahrmeier 
and Tutz (1994) and Diggle et al. (1994, 2002). Various papers in Mátyás and Sevestre (1995) 
consider nonlinear panel models. M.-J. Lee (2002) emphasizes GMM estimation. Arellano and 
Honore (2001) emphasize semiparametric methods for nonlinear panel models. Bayesian esti- 
mation with panel data is presented in Koop (2003). 


23.2 For general discussion of the incidental parameters problem see Lancaster (2002). Key ref- 
erences are Andersen (1970) for conditional ML and Chamberlain (1992) and Wooldridge 
(1997a) for differencing methods. For random effects models Butler and Moffitt (1982) 
detail use of Gaussian quadrature to eliminate normally distributed random effects, 
whereas the statistics literature emphasizes the LEE approach of Liang and Zeger (1986). 

23.4 For fixed effects logit models key references are Chamberlain (1980) for static models, 
Chamberlain (1985) for pure time series dynamic models, and Honore and Kyriazidou 
(2000) for dynamic models with additional regressors. See also Hsiao (1995). 

23.5 For selection in panel data see the survey by Vella (1998) and the texts by Baltagi (2001) 
and Wooldridge (2002). 

23.6 Chamberlain (1985) presents several ways to eliminate fixed effects in various duration 
models. Van den Berg (2001, section 6) provides a good discussion and many references. 
Event history analysis using multiple-spells data on individuals is more complicated than 
most panel analysis as the models are intrinsically dynamic. 
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23.7 The classic reference for panel count data models is Hausman et al. (1984). For dynamic 
models see Blundell et al. (2002). 

23.8 For a survey of panel semiparameteric methods see Arellano and Honore (2001) and also 
L.-F. Lee (2001). 


23-1 


23-2 


23-3 


Exercises 


Consider the nonlinear panel data model yj; = a; + exp(x’,3) + Uit, where 8 are 
parameters to be estimated, aj, i=1,..., N, are individual specific effects, uj; 
are iid [0, cĉ] errors, and the panel is short. 

(a) Suppose that all a; = 0. Can 8B be consistently estimated? If yes, provide 
the formula or objective function for a consistent estimator. If no, give a brief 
explanation of why 6 cannot be consistently estimated. 

(b) Suppose the individual-specific effects a; are random and are iid [0, oĉ] dis- 
tributed independently of the regressors. Can 8 be consistently estimated? 
If yes, provide the formula or objective function for a consistent estimator. If 
no, give a brief explanation of why G cannot be consistently estimated. 

(c) Suppose the individual specific effects a; are random but are correlated with 
the regressors. Can 8 be consistently estimated? If yes, provide the formula 
or objective function for a consistent estimator. If no, give a brief explanation 
of why 8 cannot be consistently estimated. 


(Adapted from Chamberlain, 1980) Show that MLE in a binary logit panel 
model is inconsistent, with plim of 26 ina simple T = 2 model. 


Use the same model for the Patents-R&D data as in Section 23.3, except vary 

the dependent variable and model as suggested in the following. In each case 

estimate random effects models and, if theoretically feasible, a fixed effects 

model. 

(a) Use a logit model of whether or not the firm has a patent. 

(b) Use a truncated tobit model of number of log(Patents) with observations of 
firms with zero patents dropped. 

(c) Use a Poisson model for number of patents. 
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PART SIX 


Further Topics 


In empirical work data frequently present not one but multiple complications that need 
to be dealt with simultaneously. Examples of such complications include departures 
from simple random sampling, clustering of observations, measurement errors, and 
missing data. When they occur, individually or jointly, and in the context of any of 
the models developed in Parts 4 and 5, identification of parameters of interest will 
be compromised. Three chapters in Part 6 — Chapters 24, 26, and 27 — analyze the 
consequences of such complications and then present methods that control for these 
complications. The methods are illustrated using examples taken from the earlier parts 
of the book. This feature gives points of connection between Part 6 and the rest of the 
book. 

Chapter 24, which deals with several features of data from complex surveys, notably 
stratified sampling and clustering, complements various topics covered in Chapters 3, 
5, and 16. Chapter 26 which deals with measurement errors in models studied in Chap- 
ters 4, 14, and 20. Chapter 27 is a stand-alone chapter on missing data and multiple 
imputation, but its use of the EM algorithm and Gibbs sampler also gives it points of 
contact with Chapters 10 and 13, respectively. 

Chapter 25 presents treatment evaluation. Treatment is a broad term that refers to 
the impact of one variable, e.g. schooling, on some outcome variable, e.g. earnings. 
Treatment variables may be exogenously assigned, or may be endogenously chosen. 
The topic of treatment evaluation concerns the identifiability of the impact of treat- 
ment on outcome, as measured by either the marginal effects or certain functions of 
the marginal effect. A variety of methods are used including instrumental variables 
regression and propensity score matching. The problem of treatment evaluation can 
arise in the context of any model considered in parts 4 and 5. This chapter emphasizes 
the linear regression model, so may be read early on. However, it does presume fa- 
miliarity with many other topics covered in the book, including instrumental variables 
and selection models. For this reason this topic of growing importance is placed in the 
last part of the book. 
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CHAPTER 24 


Stratified and Clustered Samples 


24.1. Introduction 


Microeconometrics research is usually performed on data collected by survey of a 
sample of the population of interest. The simplest statistical assumption for survey 
data is simple random sampling (SRS), under which each member of the population 
has equal probability of being included in the sample. Then it is reasonable to base 
statistical inference on the assumption that the data (y;, x;) are independent over i and 
identically distributed. This assumption underlies the small-sample and asymptotic 
properties of estimators presented in this book, with the notable exception of sample 
selection models in Chapter 16. 

In practice, however, SRS is almost never the right assumption for survey data. 
Alternative sampling schemes are instead used to reduce survey costs and to increase 
precision of estimation for subgroups of the population that are of particular interest. 

For example, a household survey may first partition the population geographically 
into subgroups, such as villages or suburbs, with differing sampling rates for different 
subgroups. Interviews may be conducted on households that are clustered in small 
geographic areas, such as city blocks. The data (y;, x;) are clearly no longer iid. First, 
the distribution of (y;,x;) may vary across subgroups, so the identical distribution 
assumption may be inappropriate. Second, since data may be correlated for households 
in the same cluster, the assumption that (y;, x;) are independent within the cluster 
breaks down. 

The usual methods employed to obtain the distribution of estimators therefore need 
to be adapted, and the properties of estimators may depart from results obtained under 
SRS. This is the subject of this chapter. 

The consequences for regression modeling are the following. First, weighted esti- 
mators that adjust for differences in sampling rates may be necessary if the goal of 
analysis is prediction of population behavior. Second, such weighting is unnecessary if 
interest lies in regression of y on x, provided the conditional model for y given x is cor- 
rectly specified and stratification is not on the dependent variable. Third, if samples 
are determined in part by the value of the dependent variable, such as an oversample 
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of low-income people when income is the dependent variable, weighted estimation 
is necessary. A range of estimation procedures are possible, with some presented 
in Chapter 16 in the context of sample selection bias. Fourth, clustering at a mini- 
mum leads to standard error estimates that appreciably understate the true standard 
errors and can even lead to inconsistent parameter estimates unless adjustment is made 
for clustering using methods similar to those presented in Chapter 21 for panel data 
analysis. 

The most important implication for most microeconometrics applications using sur- 
vey data is the need to control for clustering. Clustering of observations is often found 
in both cross-section and panel data, as a consequence of (1) sampling design, (2) de- 
sign of a social experiment, or (3) the nature of the observation method. An example 
of (1) is a complex large-scale household survey in which spatial clusters of house- 
holds are sampled to reduce the cost of surveys. An example of (2) is a randomized 
social experiment in which a common treatment is assigned to individuals in a partic- 
ular location such as an industrial plant or a school. Examples of (3) are regressions 
with individual cross-section data when regressors also include group averages such 
as unemployment or tax rates at the state level, use of panel data, and use of siblings 
data even if there is no clustering of households. 

Section 24.2 introduces some of the concepts and terminology of survey sampling. 
Sections 24.3—24.5 consider the three key features of survey data, respectively, sam- 
ple weights, stratification, and clustering. Section 24.6 considers hierarchical linear 
models where both stratification and clustering are present. An application to data is 
presented in Section 24.7. Complex surveys are considered further in Section 24.8. 


24.2. Survey Sampling 


Survey sampling has been well researched in the statistics literature, since data collec- 
tion must be done before any analysis, and surveying can be very expensive. The goal 
of the survey literature is usually to obtain with minimal cost a sample that can pro- 
vide unbiased and reasonably precise estimates of population parameters, especially 
the population mean. 

The structure of a multistage survey was described in Section 3.2. The U.S. CPS is 
a leading example of such a sample design. 


24.2.1. Current Population Survey 


The CPS is a monthly survey of approximately 56,000 households that is intended 
to be representative of the civilian noninstitutional population 16 years and older. 
Households in smaller states are oversampled to provide more reliable state-level 
data. Within states the surveyed households are clustered to reduce interview costs. 
Specifically, households are interviewed in four consecutive months, rested for eight 
months, and then interviewed for another four months. Reinterviewing reduces sur- 
vey costs and the 4-8-4 schedule permits some longitudinal analysis, including one- 
year differences. There are eight rotation groups of similar size, with a new rotation 


814 


24.2. SURVEY SAMPLING 


group being introduced each month. We consider the sampling design for one rotation 
group. 

Specifically, there are 792 strata, where each stratum is a subregion of a state or 
in some cases an entire state. The 792 strata are split into 2,007 PSUs, where a PSU 
may be a metropolitan statistical area (MSA), a state-MSA intersection if the MSA 
covers more than one state, a single county, or two or more contiguous counties, with 
departures from this scheme when a PSU has low population or large area. On average 
there are 2.5 PSUs per strata. Of the 792 strata, 432 contain only one PSU, in which 
case the PSU is called self-representing and is always included in the CPS survey. The 
other 360 strata have more than one PSU, and exactly one PSU is randomly chosen 
from the strata with probability proportional to the 1990 population. 

Within the PSUs there are no intermediate SSUs. The survey directly samples 
USUs, a geographically compact group of approximately four addresses. The sam- 
pling probability increases if there was low probability of drawing the PSU from its 
strata and usually increases if the PSU is in a small state, to permit oversampling 
in low-population states. (In this calculation New York and Los Angeles are treated 
as states.) All households in the USU are surveyed, unless the USU has an unusu- 
ally high number of households, in which case a subset of households is randomly 
chosen. 

The CPS is designed to be self-weighting by state so that, despite the use of nonran- 
dom sampling, the CPS should provide a representative sample for each state. How- 
ever, the unweighted sample is not nationally representative because of the oversam- 
pling of low population states and because not all PSUs are sampled. 


24.2.2. Sampling 


Before moving to a more detailed analysis of survey sampling, we provide a brief 
overview of sampling basics in the absence of complications such as stratification. 

Let z denote a vector of variables, where there is no need to distinguish between 
dependent and regressor variables. We assume that in the population the variable z is 
iid with density f(z). The population is of size N* and the sample is of size N. The 
sample is {z;,i = 1,..., N}, where i denotes the ith sampling unit. The usual notation 
in the sampling literature is n for sample size and N for population size. We instead 
continue to use N for sample size as there is only occasional need to introduce the 
population size N*. 


Exhaustive Sampling 


Under exhaustive sampling every element of the population is sampled, so the sample 
is the population. Such sampling is rare with individual-level data. It does happen 
in a population census such as the U.S. decennial census. Yet even for the census, 
subsampling is used for the lengthier questionnaires, researchers may prefer to work 
with a more manageable census subsample, and in practice the coverage of the census 
is incomplete. Exhaustive sampling is more common with firm-level data, where, for 
example, all firms in an industry may be studied. 
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Exhaustive sampling can lead to debate about whether the usual inferential methods 
are watranted, as the sample moments then equal the population moments. The usual 
procedure is to still use the usual inferential methods. This is done by viewing the finite 
population to be a sample from an infinite superpopulation. 

For example, suppose interest lies in gender differences in salary at a work site that 
has a population total of 20 men and 12 women performing similar tasks. Salaries 
are obtained for all men and women at the work site, so the sample is the population, 
and mean salary is found to be higher for men than women. It is customary to perform 
conventional hypothesis tests on the differences in mean salary, rather than to conclude 
that since the sample mean equals the population mean there is 100% certainty that 
male salaries are higher. The rationale is that the population at this particular work site 
is viewed as a sample from a superpopulation of work sites, or from a superpopulation 
of the particular work site at many points in time. 

Exhaustive sampling is expensive and is generally unnecessary for large popula- 
tions unless the actual population size needs to be determined. Instead, a subset of the 
population is usually sampled. 


Simple Random Sampling 


A simple random sample is one where observations are drawn from the population 
at random and with equal probability. Each observation appears in the sample, with 
probability equal to the sample size divided by the population size, and has the same 
marginal density f(z). The prefix “simple” is added because more systematic sampling 
methods still usually have a random element. 


Finite-Sample Correction 


Most econometric analysis presumes that SRS leads to draws of z that are independent, 
so the joint density of the sample under SRS is the product of the individual densities 
f(z;). This is reasonable if the SRS is obtained from an infinite population, as is the 
case if sampling is viewed to be from a superpopulation, or if it is obtained from a 
finite population and sampling is with replacement. 

In practice for finite populations an SRS is obtained without replacement, to en- 
sure that the same observation does not appear in the sample twice. Then observations 
are no longer independent, even under SRS. To see this, note that under SRS the prob- 
ability of any particular member of the population appearing in the sample is N/N*. 
Given knowledge that this member appears in the sample, however, the probability 
of any other member appearing in the sample falls to (N — 1)/(N* — 1). Clearly, the 
conditional probability differs from the unconditional probability. More formally, one 
introduces indicator variables for whether each case in the population appears in the 
sample. These indicator variables are joint multinomial distributed with means 7 , vari- 
ances (1 — z), and covariances —2(1 — 2)/(N* — 1), where m = N/N*. 

The correlation between sample observations is p = —1/(N* — 1), where p is 
called the intraclass correlation. Letting z be a scalar, we have that the sample 
mean Z = N7! J`; z; has variance V[Z] = NVX; zi], which does not simplify to 
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N~? X; V[zi] owing to the correlation of the z;. Some algebra given, for example, in 
Cochran (1977, pp. 23-24) yields 
S2 
VI2I=(1- A) = 
where f = N/N* is the sampling fraction, and results in this literature are usually 
simpler to express in terms of S* = (N* — 1)~! (z; — Z} rather than the usual finite 
population variance o? = N*"! (z; — 2)’. 

Thus for sampling without replacement from a finite population, the variance of the 
sample mean equals the usual S*/N multiplied by the finite-sample correction term 
1 — f. This correction term appears in statistical packages for survey data. Failure to 
allow for the finite-sample correction term leads to conservative statistical inference 
as V[Z] will be overestimated. For regression using data from SRS with replacement, 
a finite-sample correction is similarly relevant, though the extent and direction of bias 
in the variance of the OLS estimator now additionally depends on the design matrix. 

The finite-sample correction term is usually ignored in microeconometrics. This 
is often reasonable. For example, for household survey data the sample size is small 
relative to population size so that f = N/N* — 0. 


24.3. Weighting 


Household surveys such as the CPS are usually constructed in a way that leads to 
different households having different probabilities of inclusion in the sample. Sample 
weights are assigned to each observation to correct for this. 

As explained in the following, provided stratification is exogenous, weights should 
be used if regression is viewed as a tool to describe population responses but need not 
be used if the regression model is assumed to be the correct structural model. 


24.3.1. Sample Weights 


Suppose each household in the population has a probability 2; of appearing in the 
sample and assume that, unlike SRS, this probability varies across households. 

Statistics such as overall sample means that give equal weight to all observations 
will then tend to give too much weight to households that appear with high probability 
in the sample. This can be corrected by weighting, using sample weights that are 
inversely proportional to the probability of inclusion in the sample: 


wi o 1/7. (24.1) 


For example, instead of Z = N~' J; z; we may use the weighted mean 
Zw = No! > wizil >. Wi. 
i i 


Note that all that matters in (24.1) is proportionality. The weights need not sum to one, 
provided we divide by the sum of the weights. A common scaling is $`; w; = N*, 
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in which case a weight of w; means that the observation represents w; households 
in the population. Note that care is needed in using weights. Some references in- 
stead define w; « z;, and some computer packages compute the weighted mean as 
>), (zi/w;)/ >5,A./u;). It is easy to incorrectly weight by the reciprocal of the sample 
weights. 

For an SRS of size N from a finite population of size N*, m; = 1/N*, so w; is 
constant and Zw = Z. 

For simple stratified sampling with SRS within strata, suppose it is known that a 
fraction H, of the population size N* is in strata s and that N, observations are from 
the sth strata. Then 2; = N;/H,N*. It follows that the sample weights w; « H;/Ns. 

For two-stage sampling without stratification let z< be the probability that the 
cth PSU is selected and z;, be the probability that household j is selected in PSU 
c. Then the sample weights wje « 1/(a-N-7j-N), where Ne is the number of survey 
households in the cth PSU and N = )°.. Ne. A two-stage sample is self-weighting if 
the sampling probabilities at each stage are proportional to population numbers, so 
Te = NZ/N* and Tje = 1/NZ, where N* is the population size for the cth PSU. Then 
the weights w;, are equal as in SRS, though estimator standard errors may still have to 
be adjusted for the two-stage sampling as shown in Section 24.8. 

For the CPS, which oversamples households in small states, it would appear suffi- 
cient to use w; x H,/N,, where s denotes state. The CPS uses this as a baseweight, but 
then adjusts for subsampling within the USU if the USU has too many households. A 
further complication is that not all PSUs in a strata are surveyed; consequently, the sur- 
veyed households in a strata may not be representative of the strata if the sampled PSUs 
differ considerably from strata norms. This leads to two additional adjustments. First, 
adjust for unrepresentative racial (black/nonblack) composition at the strata level. Sec- 
ond, adjust weights to ensure that sample estimates for key subgroups (formed by state, 
race, sex, and age) match independent population data. For details see U.S. Bureau of 
Census (2002). The CPS sample weights are constructed to permit the CPS to provide 
nationally representative statistics, controlling for the composition of the CPS differ- 
ing from that of the U.S. civilian population on the dimensions of state, race, sex, 
and age. 

The actual computation of sample weights for multistage surveys involves estima- 
tion procedures that can be quite complicated. The weights can be misestimated; even 
if they are correctly estimated they may take into account only some of the dimensions 
of sample nonrepresentativeness. 


24.3.2. Weighted Regression 


Should one perform weighted regression when sample weights are provided? We con- 
sider this issue in detail when the stratification is not on the dependent variable. Strat- 
ification on the dependent variable is examined in Section 24.4. 

Consider estimation of the linear regression 


yi = xB + Uj, (24.2) 
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given survey data with sampling weights w;. Two possible estimators are OLS, 
Bois = (X'X) 'X’y, (24.3) 
and WLS using the sampling weights, 
Bwis = (X’WX)'X'Wy, (24.4) 


where W = Diag[w;]. 


Correctly Specified Conditional Mean 


The OLS estimator is appropriate if it is assumed that E[u|x] = 0 so that the condi- 
tional mean is linear in x, 


ELyi|xi] = x;6. (24.5) 


Then OLS is consistent for 3. Furthermore, it is second-moment efficient by the 
Gauss—Markov Theorem if the errors u; are homoskedastic. The WLS estimator is 
also consistent for 3 under these assumptions but will be inefficient if errors are homo- 
skedastic (since the weighting in (24.5) controls for unrepresentativeness of the sample 
rather than heteroskedasticity). 


Incorrectly Specified Conditional Mean 


In many applications (24.5) does not hold. Examples include cases with omitted re- 
gressors or situations when E[y|x] is nonlinear in x or E[y;|x;] = x;6; where some 
components of 8; are correlated with x;. Linear regression can still be interpreted as 
the best linear prediction of y given x under squared error loss, though this needs to be 
adapted to allow for unrepresentative sampling. 

In the population, (y;, x;) are iid, and from Section 4.2 we can always write 


y= x, B+ Uj, 
where E[u] = 0 and Cov[x,u] = 0 and 
B* = (E[xx’]) ' Elxy]. 


Note that it is no longer assumed that E[u|x] = 0, so it is possible that EL y|x] 4 x’. 

The parameter 8* is called the census coefficient by DuMouchel and Duncan 
(1983). It is the probability limit of the regression coefficient that would be obtained 
by regression of y on x using the entire population rather than an unrepresentative 
sample. 

If the conditional mean is nonlinear in x and the sample is unrepresentative of the 
population, then the OLS estimator generally does not converge to 3*, since with un- 
representative samples N~!X’X does not converge to the population moment E[xx’] 
and similarly for N~!X’y.. Intuitively, if the conditional mean is nonlinear in x then 
there is no reason to believe that linear regressions using quite different surveys of the 
same population will yield the same OLS estimates. 
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However, WLS using sample weights may consistently estimate 8*. Specifically, if 
the weighting matrix W is such that 


N7'X'WX 4 E[xx’], (24.6) 
N-'X'Wy 4 Elxy], 


then Buwis defined in (24.4) converges to 8*. 


Simple Stratified Samples 


Much of the analysis of weighted LS estimation is presented for simple stratified sam- 
pling with SRS within strata. Then it is clear that (24.6) is satisfied with w; « H,/N, 
if the ith interviewed household is in the sth strata. 

This literature also considers the possibility of different regression parameters 
within strata. It is assumed that E[y;|x;] = x}, for household i in strata s. The goal 
may be to estimate the population-weighted parameter Gy = N~' X2, N*G,. Then in 
general neither OLS nor WLS converge to Bw, unless the G, are equal across strata or 
are iid with constant mean across strata. A notable exception to this result is estimation 
of the mean of y (regression with x = 1), in which case the weighted average of the 
strata sample means is unbiased for the population mean. For details see Section 24.4.1 
and DuMouchel and Duncan (1983), Deaton (1997), or Ullah and Breunig (1998). 


Should One Use Sample Weights? 


The preceding analysis can be used to answer the question of whether to use sample 
weights in estimation, assuming there is no endogenous stratification. The discussion 
considers estimation of (possibly nonlinear) models of E[y|x], but it also applies to 
models of any other feature of the conditional distribution of y given x such as the 
median or the density. 

If one takes a structural or analytical approach and assumes that the model of 
E[y|x] is correctly specified, there is no need to use sample weights. Results can be 
used to analyze effects of changes in x on E[y|x]. 

If one instead takes a descriptive or data summary approach then weights should 
be used. Regression is then interpreted as estimating census coefficients. A major 
caveat, however, is that in complicated surveys it is not possible to compute weights 
that so clearly satisfy (24.6) as was the case for stratified sampling with SRS within 
strata. In practice sampling weights are constructed to match population proportions 
for some subgroups based on age, sex, and race. There is no guarantee that such 
weights will satisfy (24.6). 

Some data sets, such as relatively small longitudinal surveys of a few thousand 
households, are developed with a structural modeling approach in mind. Nonetheless, 
they usually attempt to provide a reasonably representative sample of the population 
while using clustered sampling to keep down survey costs. Other data sets, such as 
the CPS, are designed to provide accurate descriptive measures such as national and 
regional estimates of unemployment rates. Here designers of the survey are taking a 
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census approach and indeed would prefer a monthly census if it were not so expensive 
to conduct. 

For either sort of data set microeconometricians usually strive to take the structural 
modeling approach. As an example, consider regression of earnings on schooling 
level and socioeconomic characteristics such as age, sex, and race, but not measures 
of innate ability. 

Most econometricians would only give a descriptive interpretation to the coefficient 
of schooling in an OLS regression because of the endogeneity of schooling. The in- 
terpretation then is that if we hold certain key regressors constant, one more year of 
schooling is associated with, but does not necessarily cause, a 6% increase, say, in 
earnings. Here using sample weights in OLS regression is appropriate to permit esti- 
mates to be interpreted as measuring associations in the population, rather than merely 
those in a possibly unrepresentative sample. Even though no causal interpretation is 
possible, this estimate can be useful as it does measure how income varies across ed- 
ucational groups after controlling for some other key socioeconomic variables. After 
all, a major goal of statistics is data summary. 

A consistent estimate of the schooling coefficient may be obtained using more ad- 
vanced estimation methods, such as instrumental variables or panel data methods. 
Then the coefficient can be given a causal interpretation. Weighting by sample weights 
is no longer necessary, though the usual weighting to improve efficiency if, for exam- 
ple, errors are heteroskedastic, may be appropriate. 

Whether a model can be interpreted as correctly specified is a judgement call. If it 
is correctly specified then sample weighted and unweighted estimates should have the 
same probability limit, since both are consistent. This suggests testing correct model 
specification by a Hausman test of the difference between sample-weighted and un- 
weighted regressors, a test proposed by DuMouchel and Duncan (1983) in the case of 
linear regression. 


24.3.3. Prediction 


Consider nonlinear regression with correctly specified conditional mean, g(x, 3), and 
no endogeneity. The unweighted NLS estimator consistently estimates 6 and can be 
given a causal interpretation. In particular, we can use g(x, B) /0x to calculate the 
causal effect of a one-unit change in x of the conditional mean. 

This predicted effect varies with the evaluation point x, since g(-) is nonlinear. An 
estimate of the average response in the population is 


where w; are the sample weights. Similarly, if one instead evaluates the response at 
the mean of the regressors it may be better to use the weighted sample mean of x, an 
estimate of the population mean of x, rather than the unweighted sample mean of x. 

Even if the parameters are consistently estimated using unweighted estimation, 
weighting must be used in subsequent impact calculations if one wishes to predict 
population impacts, rather than sample impacts. 
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24.4. Endogenous Stratification 


Stratification is widely used as it can increase precision of estimation, or equivalently 
reduce survey costs for a given level of precision. For example, more precise esti- 
mation of the mean unemployment rate in low-population states may be obtained by 
oversampling poor states. For similar reasons minority groups may be oversampled. 

One complication, already considered in Section 24.3, is that parameters may vary 
across strata. For example, the mean unemployment rate may vary across strata. Then 
a descriptive approach is taken and weighted estimators are used. 

Microeconometricians often prefer a structural approach and assume parameters are 
unchanging across strata. Then from Section 24.3 stratification apparently causes no 
complications and unweighted regression is used. A major proviso is that problems 
still arise if stratification is based on the value of the dependent variable. For example, 
if low-income people are purposely oversampled and income is the dependent variable 
then the usual regression estimators are inconsistent. Note that there is no problem if 
stratification is on regressors such as race and this leads indirectly to oversampling of 
low-income people. Problems only arise if stratification is directly on income. 

In this section we define endogenous stratification and analyze the resulting com- 
plications. We then present several estimators that are consistent. The simplest is a 
weighted estimator that can be used if both the sample and population strata probabil- 
ities are known. The method is given in Section 24.4.5, which is self-contained. 


24.4.1. Stratification Schemes 


For general data z € Z the strata are subsets of Z. Econometric analysis usually par- 
titions the data into dependent variable y € VY, where for generality we allow y to be 
a vector, and regressor or independent variable x € X. The strata Cs, for s =1,...,S, 
are then defined to be subsets of the sample space Y x X. The notation is that used by 
Imbens and Lancaster (1996), who present some leading examples that are reproduced 
in Table 24.1. 

Sampling within strata is assumed to be random but some strata may be oversam- 
pled. From Table 24.1 it is clear that the strata may sum to less than the sample space 
or more than the sample space. For the fourth and fifth schemes the stratification may 
be solely on endogenous variables, solely on exogenous variables, or on a mixture of 
the two. 

The econometrics literature has focused on sampling schemes with an endogenous 
component, since in that case the usual conditional MLE is inconsistent. 

Endogenous stratification has already been considered in Chapter 16. As an exam- 
ple, consider truncated regression, where we observe y only if y > 0, so stratification 
is purely on y. Then for sampled data the conditional density of y given x is a zero- 
truncated density that divides the untruncated density by Pr[y > O|x] and so 


s _ fOlx, 0) 
f` Olx, 8) ~T— FOl, 0) 
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Table 24.1. Stratification Schemes with Random Sampling within Strata 


Stratification 
Scheme Definition Stratum Description 
Simple random S=1,0 =y x &¥ One stratum covers entire sample 
space. 
Pure exogenous Cs =YV x Xs, with XY, C X Stratify on regressors only, 


not on dependent variable 
Pure endogenous C; = VY, x X, with Y, C VY Stratify on dependent variable only, 
not on regressors 


Augmented sample S = 2, C =V x X, Random sample augmented by extra 
and C CY x X. observations from part of the 
sample space 
Partitioned C CY xX, CNC = 5, Sample space split into mutually 


exclusive strata that fill the entire 


s 
and U Cs = Ys xX. sample space 


s=l 


where the superscript s is used to distinguish the sample density from the population 
density f(y|x, 0). As discussed in Chapter 16, this sampling scheme tends to drop 
observations with low realizations of y, given x. Suppose E[y|x] = 61 + B2x and f2 > 
0. Then for low values of x there will be too many relatively high values of y. The 
regression will accordingly overpredict E[y|x] for low values of x, leading to upward 
bias in the intercept 8; and downward bias in the slope £2. 

A second example is choice-based sampling for binary or multinomial data where 
samples are chosen based on the discrete outcome y. For example, if choice is between 
travel to work by bus or travel by car we may oversample bus riders if relatively few 
people commute by bus. This example is pursued in the following. It is similar to 
case-control studies in the medical literature where, for example, a complete sample 
of people who died from a disease (y = 1) is contrasted with a similar-sized subsample 
of the universe of people who did not die of the disease (y = 0). The goal is to find 
whether one or more regressors are able to predict y = 1. 

A related example is count data on number of visits collected by on-site sampling 
of users, such as sampling at recreational sites or shopping centers or doctor’s offices. 
Then data are truncated, since those with y = 0 are not sampled, and additionally 
high-frequency visitors are oversampled. Shaw (1988) shows that the sampling dis- 
tribution of the data, f*(y|x, 0), is related to the population distribution through the 
equation 


POX, 0) = FON, JT aT 


In this case the sampling scheme is clearly endogenous though it is not a stratified 
sampling scheme. 
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24.4.2. Endogeneity Induced by Stratification 


Sampling schemes such as stratification schemes lead to the density in the sample 
differing from that in the population. If stratification is purely exogenous, then despite 
this difference the usual MLE is still consistent because the conditional density of y 
given x in the sample is the same as that in the population. However, if any aspect of 
stratification is endogenous, then these conditional densities differ, as illustrated by the 
preceding examples. We now provide a detailed discussion of this point. 

The goal of ML estimation lies in consistently estimating the parameters 0 of 
f(y|x, 0). In general the MLE should be based on the likelihood formed from the 
joint distribution of the data (y, x). In practice it is often sufficient to simply form 
a conditional likelihood from the conditional distribution of y given x. This simpler 
approach can lead to consistent estimation under the assumption that x is exogenous 
with respect to y, in which case the joint density factorizes as 


80, x19) =f Olx, 8) x Ax), (24.7) 


where the parameters of the density of x are suppressed as there is no desire to estimate 
these parameters. 

It is always the case that we can write g(y, x) = f (y|x)x h(x). The assumption made 
in (24.7) is that, upon introduction of parameters, 0 appears in f(y|x, 0) but does not 
appear in A(x). In general, rather than (24.7) we may have 


g(y, x19) =f 1x, 0) x h(x|8). (24.8) 


Then one or more components of x are endogenous with respect to y since there is 
now feedback — y depends on x but x in turn depends on y via the presence of 0 in 
h(x|@). A classic example of this is linear simultaneous equations. In such cases ML 
estimation should be based on the joint likelihood 


In Lyowr(9) = $` In fO: O+ > nhl). (24.9) 
i=l i=l 
This yields a consistent estimate of 0 if, from Chapter 5, 
_ | lng, xl9)]__ [ain fx, 0) ð In A(x|0) 
o= p| a =E 59 +E 50 : (24.10) 


Condition (24.10) is satisfied if the density g(y, x|@) is correctly specified and the 
range of the data does not depend on @. The conditional MLE instead maximizes the 
conditional likelihood 


In Lconn(8) = $ 1n f (ili, 0). 


The conditional MLE is consistent if E[d In f(y|x, 0)/30] = 0. This necessary con- 
dition is implied by (24.10) if x is exogenous, since (24.10) simplifies because then 
ð In h(x)/00 = 0. If instead x is endogenous this simplification does not occur, as the 
second term on the right-hand side of (24.10) does not disappear. So the conditional 
MLE is inconsistent if x is endogenous. 
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The problem that arises with stratification and similar sampling schemes is that 
even if the population joint density satisfies (24.7) and is the same across strata, the 
sampling schemes can lead to joint density for (y, x) in the sample that takes the more 
general form 


g`, x10) =f*(y|x, 0) x h*(x|0), (24.11) 


where the superscript s is used to denote dependence on the particular sampling 
scheme employed. Then the conditional MLE may be inconsistent, even though it 
would be consistent if the sample was instead an SRS. 

Under pure exogenous sampling the only difference between sample and popula- 
tion distribution occurs for the marginal distribution of x. Assuming (24.7) holds in 
the population, then in the sample 


gO. x0) =f Olx, 8) x h). 


Clearly, the conditional MLE will be consistent as the conditional density is still 
J (yx, 9) and @ does not appear in h°(x). 

Under endogenous sampling the more general result (24.11) holds in the sample 
even if (24.7) holds in the population. The sample and population conditional distribu- 
tions of y given x may differ, with f*(y|x, 0) Æ f (yix, 0), and A°(x|@) may possibly 
depend on 0. 


24.4.3. Endogenous Sampling 


Under pure endogenous sampling the marginal distribution of y in the sample differs 
from that in the population. Let h(y) denote the population density of y and h‘(y) 
denote the sampling density of y. (We are using the convention that g, f, and h denote, 
respectively, joint, conditional, and marginal distributions. It should be clear to the 
reader that h(y) differs from h(x).) 

The joint distribution of y and x under pure endogenous sampling is best obtained 
by first conditioning on x, rather than y. Then 


gO, x) =f(xly)h*(y), (24.12) 


where simplification has occurred because the conditional distribution of x given y is 
unaffected under pure endogenous sampling and so f*(x|y) = f(x|y). We now need 
to reexpress f(x|y) in terms of f(y|x). Now 


_ gy,x) 
falx) = 1O 


_ FOWA 
ho) 
Substituting (24.13) into (24.12) and rearranging yields 


h*(y) 
h(y|9) 


(24.13) 


g`, x10) =f Olx, 8) x 


x h(x), 
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where 


h10) = | g, xlO)dx 


= | fox O)h(x)dx. 


The conditional MLE using just f(y|x, 0) will be inconsistent because the term h(y|0) 
has been neglected. One instead needs to maximize a joint likelihood that additionally 
includes h(y|@). 


24.4.4. Endogenously Stratified Samples 


We now consider the stratification schemes introduced in Section 24.4.1. The popula- 
tion density is 
8(y, x19) =f Olx, Oh). 


There are S strata where the sth strata is the subset C, of VY x ¥. 

An important distinction is made between the population probability of an observa- 
tion being in C, and the probability of sampling from Cs, as the two differ in a stratified 
sampling scheme. We define 


H, = Pr[Draw an observation from C,], 


24.14 
Q;(@) = Pr[A randomly drawn observation from the population is in C,]. ( ) 
Here H, is set by the sample design, whereas 
0,(0) = i FOX, Ohx)dydx. (24.15) 
Cy 


The strata probabilities may or may not be known. A strata is oversampled if H, > Q,. 
We begin by obtaining the joint density of s, y, and x, where s is an indicator for 
the stratum from which the observation was obtained. In the population 


g(s, y, x10) = Qs (0)g Q, xls, 8). 


In the sample, the marginal distribution of the strata indicator differs from Q,, and 


8'(s, y, x10) = Hsg(y, xis, 8) 
_ y SK OH) 
Q;(8) 
where the second equality holds as g(y, x|s) equals the density g(y, x) = f(y|x)h(x) 
divided by the population probability of being in strata s so that the density integrates 
over C, to one. 
It follows that the joint density is 


A, 
*(s, y, x0) =~—~ f Olx, h(x), (24.16) 
gs, y, x| 0.(6) fol 
where Q,(0) is defined in (24.15). The conditional MLE based on the population con- 


ditional density f(y|x, 0) will be inconsistent for O since it ignores the term Q,(0), 
which depends on 0. 
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A variety of consistent estimators have been proposed. Here we consider maximum 
likelihood estimation, GMM estimation, and a much simpler weighted estimator that 
can be implemented provided both strata sampling probabilities H, and population 
probabilities Q,(0) are known. 


Maximum Likelihood Estimation 


Performing an ML estimation based on the joint density g‘(s, y, x|0) in (24.16) is 
complicated because from (24.15) the distribution of Q,(@) depends on h(x). One 
possible solution is to specify the density h(x). This approach is not taken because 
econometricians shy away from specifying the distribution of regressors, even if there 
is a willingness to specify the conditional distribution of the dependent variable. 
Instead, a semiparametric approach is taken, with the goal of estimating the pa- 
rameters of the specified density f(y|x, 0), for an unspecified density h(x). For sim- 
plicity assume the population strata probabilities H, are known. Cosslett (1981a) ob- 
tained the MLE with endogenous stratification by first letting x be discrete with 
x; occurring with probability w;, and maximizing the joint likelihood with respect to 


6 and w;, i =1,..., N. The first-order conditions can be collapsed to yield a con- 
centrated likelihood that involves only (q + S$ — 1) parameters 0 and functions 1,(8), 
s=1,...,S8—1. Second, maximizing this concentrated likelihood with respect to 6 


and 4, yields the same estimates as maximization with respect to 0 and 4,(0). Third, 
since it is valid to treat À, as a parameter the same procedure can be used for the case of 
continuous regressors. A problem of dimension q plus infinite-dimensional unknown 
density h(x) has been reduced to g + S — 1 dimensions. 


GMM Estimation 


The remarkable results of Cosslett (198 1a) are difficult to implement. 

Imbens (1992) devised a simpler GMM estimator with endogenous stratifica- 
tion that has the same efficiency as Cosslett’s MLE. A quite general framework and 
presentation of this estimator is given by Imbens and Lancaster (1996), for stratified 
samples obtained by multinomial sampling, standard stratified sampling, or variable 
probability sampling. The joint density is again g*(s, y, x|0) in (24.16) and the sample 
strata probabilities H, are permitted to be possibly unknown. The GMM analysis is 
based on S — 1 equations for the score of Hs, q equations for 0 based on the condi- 
tional likelihood function of y given s and x, S — 1 equations for the restrictions on 
the population strata probabilities Q,(0), and a final restriction that is not necessary if 
there is a linear restriction on the Q,(@), which happens, for example, if the strata are 
mutually exclusive and cover the sample space. 


24.4.5. Weighted Estimation 


Endogenous stratification is easily dealt with when the sample and population strata 
probabilities, H, and Q,(0) defined in (24.14), are known, though the estimator is 
not fully efficient. We begin with ML estimation before considering more general 
estimators. 
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Weighted ML Estimation 


Manski and Lerman (1977) proposed the weighted maximum likelihood (WML) es- 
timator. This maximizes 


Owm(0) = 2 H, POI, 0), (24.17) 


where H; = H, and Q; = Q, if the ith observation is in strata s. 

Manski and Lerman (1977) called this estimator the weighted exogenous sam- 
pling estimator (WESML), since (24.17) multiplies the usual term In f(y;|x;, 0) in 
the conditional likelihood under exogenous sampling by the weight H;/Q;. How- 
ever, the designation WESML can lead to confusion as the problem here is one of 
endogeneity — it just turns out that appropriately weighting the usual exogenous esti- 
mator leads to consistent estimation. 

Along similar lines, the objective function QwmL (0) is not formally a likelihood, 
since (24.16) does not imply that the sample conditional density of y given x and s is 
given by f*(y|x, 0) =f(y|x, @)2:/4s_ Nonetheless, the WML estimator is consistent. 
The WML estimator solves the first-order conditions 

3 Q; 3 ln f(yi|xi, 0) 
H; 00 


= 0. (24.18) 


This estimator is consistent if the terms in the sum have zero expected value, where 
expectation is with respect to the sampling density g*(s, y, x|@) in (24.16). Now 


Q, ð In flx, 8) 
E pas] (24.19) 
ff [Osan fox, 0) H; 
= f J$: 39 Oy O% Mh wdvax 


a(n ain coe O Foyle, Dh)dydx 


= = fE [onono h(x)dx 
= 0, 


under the usual regularity condition that in the population the specified density satis- 
fies E[0 In f(y|x, 0)/30] = 0. So the WML estimator is consistent in the presence of 
endogenous stratification. 

The information matrix equality does not hold for objective function QwmL(0) in 
(24.17), so we need to use the sandwich form N~!A~!BA™! for the asymptotic vari- 
ance of Pens where 


1 4 Q; 8 In fOilx, 0) 
A(0,) = pli 24.2 
(80) pima 2 H 9090" |p, (24.20) 
and 
ð In fi |x, 0) 3 In f(y; |x:, 0) 
B00) = plim- È ($ J f 7 D f o (24.21) 
i bo 
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This estimator is less efficient than the ML estimator of Cosslett or Imbens, but it is 
relatively straightforward to implement. It does, of course, presume knowledge of the 
strata probabilities. 


Weighted m-Estimation 


The weighted ML estimator can be applied to estimators other than conditional ML 
estimation. For example, Hausman and Wise (1979) consider similar weighted estima- 
tion for least-squares regression. 

Thus suppose with SRS we would minimize `; ¢(y;|x;, 9), with first-order condi- 
tions `, dq(y;|x;, 0)/30 = 0, and suppose in the population that 


E[dq(y|x, 80)/30)] = 0, 


a necessary condition for consistency. Then if sampling is instead endogenously strat- 
ified as in Section 24.2 and the sample and population strata probabilities H, and Q, 
are known, then @ is consistently estimated by the weighted m-estimator Ow that 
minimizes 


Ow) =) S40 Ixi, 9). (24.22) 


The proof of consistency follows (24.18) and (24.19) for the WML estimator and 
the variance matrix is of the form N~'A~'BA™!, where A and B are given in 
(24.20) and (24.21) with the sole change being replacement of 0 In f(y;|x;, 9)/00 by 
dq(i|x;, 9)/00. Wooldridge (2001) provides a formal proof. 

Similarly, for estimation based on the g population moment conditions 


E{h(y, x, @)] = 0, 


under endogenous stratification, use the weighted estimating equations estimator 
that solves 


D Phos, x;, 0) = 0. 
The weighted MLE results apply with 3 In f(y;|x;, 0)/00@ replaced by h(y;|x;, 8). 
Note that the weights Q;/H; are the same as those proposed in Section 24.3.2 for 
estimation of the census parameter under simple exogenous stratified sampling. The 
motivation, however, is quite different. In the current section it is assumed that con- 
ditional moments are correctly specified so that with exogenous stratified sampling it 
would be consistent and efficient to do unweighted estimation. The weights become 
necessary if stratification is endogenous. 


24.5. Clustering 


Sections 24.3 and 24.4 on weighting and stratification covered methods to control for 
a survey design that leads to a sample distribution that differs from the population dis- 
tribution. The assumption of independence of sampled observations was maintained. 
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In fact survey data are usually dependent. This may be due to use of clustered sam- 
ples to reduce survey costs, such as interviewing several households on the same block. 
In such cases the data may be correlated within a cluster owing to the presence of a 
common unobserved cluster-specific term. Such dependence may also arise, however, 
even with SRS. For example, it may be felt that there is an unobservable effect com- 
mon to all households in the same state. 

There are several different methods for controlling for dependence on unobserv- 
ables within a cluster. If the within-cluster unobservables are uncorrelated with regres- 
sors then only the variances of the regression parameters need to be adjusted. If instead 
the within-cluster unobservables are correlated with regressors then the regression pa- 
rameters are inconsistent and suitable alternative estimators are needed. The analysis 
is further complicated because methods may also vary according to whether there are 
many small clusters or few large clusters. Additional complex survey complications 
such as weighting and stratification are deferred to Section 24.6. 

The notation and models are presented next, with the key distinction being between 
random cluster effects and fixed cluster effects, similar to panel data analysis. The 
various estimators are presented in subsequent sections. 


24.5.1. Cluster-Specific Effects Models 


Interest lies in estimation of a linear regression model given data (y;, x;),i =1,..., N, 
where i denotes the ith sample observation, such as a household. 

The concern is that some aspects of the population regression model vary by cluster 
c,c =1,..., C. Suppose the ith household in the overall sample is the jth household 
in the cth sampled cluster. A quite general model for clustered data is 


Yje =X Obi Fa Gest No ea ue (24.23) 


where Cov[u jc, Uge] # 0 though Cov[u jc, uka] = 0 for c # d. This model incorpo- 
rates cluster dependence through both regression parameters that vary across clusters 
and errors that are correlated within a cluster. 

Here we focus on a special case, the cluster-specific effects model 


Yje = Leb + Qe + Eje- (24.24) 


Here just the regression intercept a, varies across clusters, whereas the slope coeffi- 
cients are assumed to be constant across clusters. In the simplest model £ ;, is assumed 
to be homoskedastic, 


Eje ~ [0,07], (24.25) 


an assumption that can be relaxed to permit heteroskedasticity and correlation within 
a cluster. More substantively, different assumptions on a, lead to two quite different 
models, which we now present. 
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Cluster-Specific Random Effects 


In the cluster-specific random effects (CSRE) model the intercepts a, in (24.24) 
are purely random with distribution that does not depend on any observables. In the 
simplest case it is assumed that 


a ~ [0,02]. (24.26) 


This model is directly analogous to the random effects model for panel data. The 
model is just a linear regression of yje on xj-, with the complication that the error 
term &e + €j¢ is correlated for observations in the same cluster. An OLS estimation 
is consistent but inefficient. Importantly, the correlation of errors makes it necessary 
to adjust the usual standard errors of the OLS estimator. A GLS estimation is more 
efficient. 

Given assumptions (24.25) and (24.26) on ej. and a, V[ae + Eje] = oÈ +o? 
and Cov[ae + Ejec, Me + Eke] = og, for k # j. We define the intraclass correlation 
coefficient 


2 
p = Cora. + Eje, e + Erc] = =. (24.27) 
og +o; 

There is a one-to-one correspondence between (o2, 62) and (o°, po), where p is 
defined in (24.27) and o? = of + of. The CSRE model is equivalent to a model with 
constant intraclass correlation coefficient. The model can also be given a Bayesian 
interpretation, viewing each observation as having its own intercept a, that is a draw 
from a univariate distribution and appealing to the exchangeability criterion that the 
subscript in a je is a purely labeling device and has no substantive consequences. In all 
cases clustering has the expected effect of inducing positive correlation between error 
terms within a cluster. 


Cluster-Specific Fixed Effects 


In the cluster-specific fixed effects (CSFE) model the intercepts a, in (24.23) are 
random unobservables, as for the CSRE model, but may possibly be correlated with 
the regressors. For identification x;, no longer includes an intercept term. 

This model is directly analogous to the fixed effects model for panel data. The model 
has conditional mean E[yj¢|Xjc, @] = x +a,. The OLS estimator from regression 
of yje on x;- alone is inconsistent for 6 if the omitted variable œc is correlated with x jc. 
Consistent estimation of 3 requires consistent estimation of a, which is possible if the 
clusters are large. If clusters are instead small the individual w, need to be eliminated 
by a differencing transformation. 


Comparison to Panel Data Analysis 


The setup and terminology clearly closely parallels that for static panel data analysis 
presented in Chapters 21 to 23. At the same time there are some departures from panel 
data analysis. 
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In the panel case the individual unit of analysis, such as the household, is observed 
more than once whereas in the cluster case the individual unit of analysis is observed 
only once. In the panel notation it, the first subscript is the clustering unit if the panel 
is a short panel, whereas in the clustering notation jc, the second subscript is the 
clustering unit. In the panel case we focused on balanced panels, but clustered data are 
usually unbalanced as N, varies across clusters. 

Microeconometrics methods for panel data focus on short panels. This is analogous 
to having few observations per cluster and many clusters. Then Ne is small and C > 
oo, which we call small clusters. In addition, it is not unusual to have large clusters, 
with N: — oo and C small. For the CSFE model with large clusters there will only be 
a few parameters a, to estimate and the incidental parameters problems will not arise. 

Unlike as in panel data, the appropriate clustering unit may not always be clear. For 
example, for the CPS data clustering could be viewed as arising within state, within 
strata, within PSU, or within USU. This issue is deferred to Section 24.6. The intra- 
cluster correlation is expected to decrease for clustering at more aggregate levels. If 
clustering is at the state level then the clusters are large, whereas if clustering is viewed 
as being at the level of USU then the clusters are small. Moreover, it is possible that a 
data set does not include necessary clustering information, such as the strata or USU 
for an observation. 

The analogue of dynamic, rather than static, panel data models is a model where 
yjc depends not only x;, but also on x;,, for k Æ j. For clustered data it is usually 
sufficient to specify a peer-effects model that more simply includes just the cluster 
average X., since the ordering of observations within a cluster usually does not matter. 


Overview 


The three common estimators for clustering are the OLS, the GLS, and the within 
estimators presented in Sections 24.5.2—24.5.4. The properties of these estimators, 
summarized in Table 24.2, vary with the true model. Most importantly, if the true 
model has cluster-specific fixed effects then OLS and RE estimators are inconsistent, 
whereas the within estimator yields consistent estimates but only for coefficients of 
regressors that vary within a cluster. Secondly, even if an estimator is consistent the 
usual standard errors will often need to be adjusted to control for clustering and possi- 
bly heteroskedasticity as detailed in the following. 


Table 24.2. Properties of Estimators for Different Clustering Models 


Section Estimator Cluster Model Consistent 
24.5.2 OLS Random effects Yes 
Fixed effects No 
24.5.3 GLS for random effects Random effects Yes 
Fixed effects No 
24.5.4 Within for fixed effects Random effects Yes 
Fixed effects Yes 
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24.5.2. OLS Estimator 
We consider the OLS regression 
Yje = Xj + Uje- (24.28) 


Ordinary LS is inconsistent because of omitted variables bias if the true model is the 
CSFE model (i.e., Uje = @ + &j-) with fixed effect a, correlated with x;,. Then the 
OLS estimator should not be used and instead the CSFE estimators of Section 24.5.4 
should be used. 

In contrast, OLS is consistent in the CSRE model, where a, is a random effect 
uncorrelated with x;,. More generally, OLS is consistent under richer models for u je 
than the CSRE model, provided uj. is uncorrelated with x;,. We consider the OLS 
estimator in this case, with focus on obtaining correct standard errors given correlation 
of the error term u je within a cluster. 


Notation 
Stacking observations in (24.28) within a cluster yields 
Ye = XB + Ue, (24.29) 


where y, and u, are Ne x 1 vectors and X. is an Ne x K matrix. Further stacking over 
clusters yields 


y=X8+u, (24.30) 


where y and u are N x 1 vectors and X is an N x K matrix, N = )°. Ne. 
The three representations of the CSRE model lead to three equivalent ways of ex- 
pressing the OLS estimator of model (24.28), 


Bors = (XX) X'y (24.31) 
c -l1 c 
(Sex) fxs 
c=1 | F P 
= (E Se) DL Ye 


c=1 j=l c=1 j=l 


The second of these representations is especially useful given the assumption of 
independence of errors across clusters. Then, as before in the panel case, the OLS 
estimator has limit distribution 


VN (Bots — 8) S N [0, ABA], (24.32) 
where 
C 
A = plimN™' X` X,X., (24.33) 


c=1 


c 
B = plim N`! X X'uuX,, 


c=1 
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using independence of u, over c. Different assumptions on the properties of u, lead to 
different estimates of B. 


OLS Cluster-Robust Standard Errors 


If clusters are small then there are many clusters and B in (24.33) can be consistently 
estimated by replacing ue by Ue = ye — X,3. It follows that Bors is asymptotically 
normally distributed with cluster-robust variance matrix 


= č = 
V [Bos] = (È X.X x.) 3 X.u.U,X. (> “x, f (24.34) 
c=1 


This formula places no restriction on heteroskedasticity and correlation within a 
cluster, as V[u,] and hence V[u je] and Cov[u jes Uke | are unrestricted. However, it 
does assume that N, is small and C — oo. Statistical packages often give a degrees- 
of-freedom correction. Typically one multiplies the estimate in (24.34) by 


N-1 C 
X Por 
N-K C-1 
which corrects for both estimation of 6 and the number of clusters in practice being 


finite. 
To see how (24.34) works, treat the regressors as fixed and note that 


dfc = 


C 
=limN' $` X,E [usu] 
c=1 


= limN~ DDD 


c=1 j=1 


Ne Ne 
1 È 
E [u jcttee| XjcXke: 


= 


Then (24.34) is obtained using the estimate 


C 
R =i IAN 
B=N > X uu, X. 
c=1 
C N Ne 
=f AA / 
=N J U jcUkcXjcXke. 
c=1 j=l k=1 


For example, consider estimation of E[y] by y. This is the regression (24.28) 
with xj = 1, Bors =y, and Uje = Yje — 9. Then (24.34) leads to Vi] = 
N~2 A Oje- y))*, compared to the estimate of N“! >. LO- 5)? which 
additionally assumes independence within clusters. 


OLS Standard Errors Assuming the CSRE Model 


The cluster-robust estimates (24.34) require many clusters. Alternative estimates that 
also apply to the case of few clusters can be used if assumptions are made about the 
variances and covariances of the model error u jc. These alternative estimates also per- 
mit analytical results regarding the impact of clustering on estimator variances. 
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In particular, assume that the CSRE model given by (24.24) to (24.26) is appropri- 
ate. Then the error u je = a, + €;- is independent over c and within a cluster 


a, j=k, 


Cov[u jC» u c] = | n 
as po, j#k, 


where the intraclass correlation coefficient p is defined in (24.27). It follows that 
D = Viu] = o7[( — pyle + pece], (24.35) 


where Ie is an Ne x Ne identity matrix and e, is an Ne x 1 vector of ones. 
Given &, in (24.35), the general result (24.32) to (24.33) yields 


c —1 
V [Bors] = (> XX x.) XO PXI — ole + pece, ]X. (È “x, . (24,36) 
c=1 


c=1 


Provided the intraclass correlation coefficient is constant, this variance matrix estima- 
tor is consistent in both the small- and large-cluster cases. Obvious estimators for o? 
and p are 


and 


ne 1 ne ee ot 
P= SONNE Ly De 2 tice 


The estimate of p involves many intracluster pairs and a consistent estimate can be 
obtained using just a subset of these. As written $`, N.(N, — 1) pairs are used, though 
in fact each unique within-cluster pair is double counted as both je and Ukct jc 
appear in the summations. 

If the clusters are large the intracluster correlation can be permitted to vary across 
clusters. Then (24.35) and (24.36) can be amended to replace o? and p by o and pe, 
respectively. These can be consistently estimated by 


A 1 L x 
= 7, 
N.- K-14 
j=l 
and 
Ne Ne 
x 1 1 bx at 
Pe Pay U jcUke 
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Bias of Usual OLS Standard Errors 


If data are clustered, then intuitively the usual formula variance estimator for the OLS 
estimator, 


-1 
yee [Bors] = ($ XX x.) , 


underestimates the true variance matrix of the OLS estimator, assuming positive 
within-cluster correlation, since each additional observation within a cluster will pro- 
vide less than one additional piece of independent information. We demonstrate this 
bias when the error process is that of the CSRE model. 

Consider the CSRE model with the same regressors within each cluster, so Xj- = Xc 
and X, = e-x’.. Then by using ee. = Ne, (24.36) becomes 


-l¢ 1 
V [Bots] = (Spx ; Yo Nco7 [1 + p(Ne — DXX, (Èn XcX x) i 
n 
a result presented by Kloek (1981) and Moulton (1986). 
Now specialize to balanced clusters, and define M to be the average cluster size, 
so M = N, = N/C is constant. Then the variance estimate simplifies to 


c -1 
V [Bors] = [1 + p(M = 1)1 x o (u 2 ; 
c=1 


whereas the formula variance simplifies to o?(M )~.x-x’,.)!. It follows that the true 
variances are a multiple 


t=[1+ pM- D] 


times the usual OLS variance matrix estimate. Even if p is small the correction fac- 
tor can be quite large. For example, if the average cluster size is M = 101 obser- 
vations, then the usual OLS standard errors should be multiplied by ~I + 100p. 
The assumed independence within a cluster will also lead to a biased estimate 
of o*, but this is of second-order importance. In the balanced-cluster case Kloek 
shows that E[}°. 4 PA l= o?[N — K(1 + p(m — 1))] so we should normalize by 
[N — K(1 + p(m — IP 1 rather than [N — K]~!. 

In practice some regressors may be constant within a cluster and others may 
vary. Then in the case of regression with intercept and scalar regressor (i.e., xB = 
bı + B2x;-) Scott and Holt (1982) show that the usual OLS formula variance for the 
intercept should be multiplied by 1 + o(M — 1) as done in the preceding, but for the 
slope coefficient it should be multiplied by the smaller factor 1 + ,0(M — 1), where 
P, can be viewed as an estimate of the intraclass correlation coefficient of the xj. In 
cross-section applications p, is relatively small, so the main problem lies with standard 
errors for cluster-invariant regressors. 
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Moulton (1986) demonstrated in an application that the bias in standard errors us- 
ing the incorrect OLS formula variance can be quite appreciable. He estimated a log- 
wage equation using cross-section CPS data where clustering was on states. For his 
application N = 18,946 and C = 49. For his data the estimated intraclass correlation 
coefficient was = 0.032, a seemingly small value. However, the clusters are large, 
and if we ignore the data being unbalanced and as a guide use the preceding for- 
mulas with M = 387, the average cluster size, then T = [1 + (M — 1)] = 13.3. For 
state-invariant regressors the true OLS standard errors are predicted to be yv 13.3 = 3.7 
times the usual reported standard errors, a very large bias. (One way to view this is 
that for OLS estimation of the coefficients of state-invariant regressors, the 18,946 
clustered observations have the same precision as 18,946/13.3 = 1,425 independent 
observations.) For individual-varying regressors the bias will be much smaller, for ex- 
ample, [1 + (,0(M — 1)] = 2.23 if, = 0.10. Moulton does not report results for the 
individual-varying regressors included as regressors. For the state-invariant regressors, 
variables such as growth rate of employment in the state, the cluster-corrected standard 
errors for OLS are generally between three and four times the incorrect formula stan- 
dard errors. 

The lesson is that there can be great downward bias in the default OLS standard 
errors for the OLS coefficients of cluster-invariant regressors. For individual-varying 
regressors there is also bias, but it is much less. Cluster-invariant regressors are of- 
ten included in applications with clustered data, as it is common to model individual 
behavior as depending in part on attributes of the cluster. Valid statistical inference 
requires obtaining standard errors that control for clustering. 


24.5.3. Cluster-Specific Random Effects Estimator 


If a random effects model is appropriate then the GLS estimator is in general more 
efficient than the OLS estimator of the previous section. Given independence across 
clusters the GLS estimator of model (24.29) is 


c -le¢ 
Boise = (> xxx) XOXE Ye, (24.37) 
c=1 cal 


where X. =V[u,]. The feasible GLS estimator replaces X, by a consistent estimate 
X., and assuming correct specification of the model (24.29) and error variance matrix 
=, we have 


=i 
V [Bors. RE] - ($x; XX x) f 


For the CSRE model, X, given in (24.35) can be consistently estimated by S, 
which replaces ø? and p by the consistent estimates given after (24.36). As in the sim- 
ilar random effects model for panel data, the feasible GLS estimator is asymptotically 
equivalent to the MLE under the additional assumptions that a, and £j¢ are normally 
distributed. 
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An attraction of the CSRE model is that the GLS estimator (24.37) can be simply 
implemented by OLS estimation of the transformed regression 


Yje OcYe = je =<" Ož) B Te (Eje <2 OE), (24.38) 


where 


JIZ : 
j= Bei 7 , (24.39) 


JI+Fo N.-D Joz + N.o2 


This result is proven later in this section. To implement it we replace 6. by consis- 
tent estimate 6,. As for the panel data model, it can be shown that usual OLS stan- 
dard errors from this regression can be used if the errors ¢;, in model (24.24) are 
homoskedastic. 

The GLS estimator is at least as efficient as OLS assuming (24.24) to (24.26) hold. 
In the special case that all regressors are cluster-invariant there is no efficiency gain as 
GLS then coincides with OLS (Kloek, 1981). More generally, Scott and Holt (1982) 
give a quite conservative upper bound to the efficiency loss of OLS compared to 
GLS as 


VieBeus! aa (14 Cte- ay 
Vicos] Nop? 


for arbitrary vector c and where No = max{N-} is the sample size of the largest cluster. 
This bound is increasing in No and p, and even for No = 1,000 and p = 0.10, OLS is 
at most 22% less efficient than GLS. 

Given these small efficiency gains to GLS it is more common to focus on OLS 
estimation with correct standard errors, unless OLS is inconsistent because the CSFE 
model is appropriate. The main impact of clustering is that OLS is much less efficient 
compared to the case of no clustering, as is clear from the discussion of calculation of 
standard errors for the OLS estimator in Section 24.5.2. 

If clusters are large, then the CSRE model can be relaxed to permit the error vari- 
ance and intraclass correlation to vary across clusters. Then in (24.35) for 4. we re- 
place o? and p by o and pe, respectively, using consistent estimates for o? and pe 
given after (24.36). 

If clusters are small then robust standard errors that do not constrain error corre- 
lation to be constant within a cluster can be obtained, analogous to (24.34) for OLS. 
Then 


1 


V [Botsre] = [Exs x| Ys; 2 aa Sox x] Ox: x 


where n, = ye — X- Bors rE- This estimate requires N, small and C — oo, and it as- 
sumes independence of errors in different clusters. 
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GLS Implemented as OLS in a Transformed Model 


To derive (24.38), note that for X, defined in (24.35) 
Do! = [o7[(1 — pl. + pece] 
1 
CT aa I, ei c)&e f oes 
Le a! (0/Tc)€c€, J 
where te = 1 + p(N, — 1) and hence 


1 
yl? = z [Te — (@c/NeJece.], 


JTE 


using the general results that if e is an M x 1 vector of ones then 


[I + aee]! = I-[a/(1 + aM)lee’, 
[I + aee']!? =1-M~! (1 ar aM) Mee’. 


Now in (24.37) XLEZ'X, = (E7 °X.) E7"? Xe, where 


D7 X, = [L — (Bc/Ne)ece, Xe 


=/ 
= X, — eX, 


and where že = N>' >> ;Xjc and we ignore the scalar multiple 1/o ./1 — p as it will 


cancel out when we similarly consider X’. &7'y.. The transformed regression model 
(24.38) follows. 


24.5.4. Cluster-Specific Fixed Effects Estimator 


The basic idea of the CSFE model is straight forward: Let the cluster effect enter the 
conditional mean function through the intercept term. The model is 


Yje =A HR B+ Ej Gols No C= (24.40) 


where now both G and œe, c = 1,..., C, are parameters to be estimated. 

In the CSFE model all cluster-invariant regressors must be dropped, as they cannot 
be separately identified from œe. For example, if clustering is on the state and a fixed 
effects model is appropriate then the effect of state-invariant regressors such as state 
average unemployment cannot be identified. If estimation of the coefficients of state- 
invariant regressors is desired then OLS or the CSRE estimator need to be used instead. 
However, one should first use a Hausman test analogous to that presented in Chapter 21 
for panel data to confirm the validity of the strong assumption of the CSRE model that 
a; is uncorrelated with the regressors. 

We consider statistical inference under the assumption 


2 
Eje X [0, oie]. 


This permits heteroskedasticity of unknown form but assumes that inclusion of the 
cluster-specific fixed effect a, is sufficient to control for any error correlation within 
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a cluster. This is a departure from panel data analysis where concern about time- 
series correlation in the errors even after inclusion of individual-specific effects leads 
to richer models. If desired, however, one can additionally adjust estimator standard 
errors for correlation within a cluster by methods similar to those in Section 24.5.2. 

The main complication in estimation of the CSFE model is that in small clusters 
there are too many intercepts a, to estimate. 


Cluster Dummy Variables Model 


We first consider large clusters, where the number of clusters is small relative to the to- 
tal sample size. Then the intercepts w, can be estimated directly by introducing dummy 
variables for each cluster and estimating by OLS. 

Let observation i denote the jth household in the cth cluster. Then (24.40) can be 
written as the cluster dummy variables model 


C 
Yi = $ deda +x Pte, i=1,...,N, (24.41) 
c=1 


where the dą; are indicator variables that equal one if the ith observation belongs to 
cluster c and equal zero otherwise. Thus C cluster indicator variables, such as state 
dummy variables, are included, and to avoid the dummy variable trap, x should not 
contain an intercept term. 

An OLS estimation of this model yields consistent estimates of both «1, ..., œc 
and 8, assuming a fixed number of clusters C as N — oo. One can use the usual 
Eicker—White estimate to obtain standard errors that are robust given heteroskedastic 
errors. 


Within-Clusters Estimator 


When there are many small clusters we can no longer estimate the model (24.40) by 
OLS. First, OLS estimation may not be computationally feasible because the number 
of parameters (C + K) — oo as the number of clusters C — oo. Second, and more 
importantly, because the number of parameters is going to infinity with the sample 
size, the OLS estimator is inconsistent unless Ne — œo. 

Interest usually lies in the parameters 8 in (24.40), with a1, ..., ~c viewed as inci- 
dental parameters or as nuisance parameters. Then it is convenient to sweep out the 
fixed effects by an initial data transformation. Each observation (yje, Xjc) is replaced 
by deviation from the cluster mean, that is, by (yje — Ye, Xje — Xc), i = 1,..., Ne, 
c=1,...,C, where Ye = No! > Vie and že = NZ! >| Xjc are cluster-specific av- 
erages. Then the model (24.40) for y;- implies that 


Yje — Je = (Xje — Že) B+ je — Bc. (24.42) 


Applying OLS to the transformed regression (24.42) yields a consistent estimate 
of B. If the CSFE coefficients are also of interest, they can be estimated by Q, = 
Ye — XB, though this estimate is not consistent for small Ne. 
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Comparison with Chapter 21 shows that this is analogous to the within estimator 
for panel data. As for panel data, the estimate of G from OLS estimation of (24.42) 
coincides with the estimate of G from OLS estimation of the cluster dummy variables 
model (24.41). 

A between estimator can also be proposed analogous to that for linear panel mod- 
els. In this case y, is regressed on X,,c = 1,..., Ne. From (24.37), the GLS estimator 
in the CSRE model involves regression in quasi-differences, where cluster means are 
multiplied by 6, (defined in (24.39)) before differencing. The GLS estimator can be 
shown to be a linear combination of the within and between estimators. It approaches 
the within estimator for large N. as then 6. — 1. Note that the within estimator is 
consistent in the CSRE model. 

Caution is necessary in interpreting the standard errors if the regression is applied to 
the mean-corrected observations. The number of degrees of freedom for this regression 
is (N — K — C), not (N — K). If software neglects this adjustment then the residual 
variance from the software should be adjusted by multiplying by the inflation factor 
(N — K)/(N — K — C) and the standard errors should be inflated by the square root 
of the same. 


24.5.5. Diagnostic Tests for Cluster Effects 


In linear regression a test of cluster-specific fixed effects under normality of errors is 
just the standard F-test of linear restrictions hypothesis Hp : aj = @2 =---=ac =0 
in (24.40). This simply involves a comparison of the R? statistic for the two regressions 
with and without the cluster-specific dummy variables. 

In the CSRE model a test of cluster effects is a one-sided test of the null hypothesis 
Ho : o2 = 0 versus H; : of > 0. An equivalent test can also be formulated as a test of 
Ho : p = O versus H; : p > 0 using the definition in (24.27). The one-sided LM test 
statistic of this hypothesis, given by Moulton (1987), is 


Pa Ee (Nete)? — va. YG nz, 


(24.43) 


where 6? = X`, X; 02./N, Tic denotes the least-squares residual from the pooled re- 
gression of y on x, and u, is the average residual for cluster c. 


24.5.6. Clustering in Nonlinear Models 


Nonlinear models with clustered data have not attracted much attention in the econo- 
metrics literature. There are numerous published articles in biostatistics, however, with 
a special focus on binary outcome models (Pendergast et al., 1996). Other models such 
as the Poisson regression and some models for survival data have also been considered. 
The hierarchical (multilevel) modeling framwork has also been used extensively espe- 
cially for binary outcome models. 

Here we continue to exploit the parallel between clustered and panel data. As in the 
linear case the data (y;, x;),i = 1,..., N, are subscripted as (yjc, Xjc), j = 1,..., Ne, 
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c=1,...,C. We assume independence over c but permit dependence of observations 
within cluster c. 


m-Estimation with Clustering 


Consider a nonlinear estimating equations estimator that solves 


Ne 
Do DO jc. Xj 8) = 0. (24.44) 
j=1 


Often these equations are obtained from maximization or minimization of the objective 
function >>. Dae, d(Y jes X jc, 9), in which case h(y je, Xjce, 9) = OG(Vjc, Xjc, 9)/00. For 
example, for quasi-MLE based on the product of marginal densities h(yj-, Xjc, 9) = 
aln f(y jclXjc, 0)/30. 

We assume that data are clustered, so that Cov[h;., hzc] 4 0. However, we maintain 
the assumption that E[h(y;., x;-, 0)] = 0, a necessary condition for consistency, which 
rules out the cluster-specific fixed effects model also presented in the following. 

The cluster-robust variance of the OLS estimator (24. 34) i is easily adapted to the 
current situation by replacing xj-x’,, by dhj-/00’ and xj-Wjc by h;-(0). Then @ is 
asymptotically normal with cluster-robust variance matrix 


Hl ve OIIE ONS i 
‘| >. hje@)hy.Y (> sat 


c=1 j=1 k=1 c=1 j=l 


Ne ah’, Ne Ne 


vai- (> 


c=1 j=l 


T . (24.45) 
0 


Some computer software provides this as a standard option for many parametric non- 
linear models. 

A leading example is quasi-ML estimation based on the product of marginal densi- 
ties within a cluster rather than the joint density. Specifically, given dependence over 
j within cluster c we should maximize the log-likelihood 


Cc 
InL(@) = Xoin fic, aaa | YN.c> Xic, PREN XN.c> 0). 


However, the joint density may be difficult to work with or difficult to obtain because 
for many univariate densities there can be a limited range of multivariate densities. 
Instead, we may maximize 


Q0) = res Xic, 0)X +++ x FON Xn,» )] 


“in Ojo Xjcs 0), 


which is no longer a true likelihood function, unless yje are independent over 
j, so the information matrix equality no longer applies. The preceding formu- 
las apply with h;e(0) = 31n f (yje, Xjc, 0)/30 and əh ;e(0)/30' = 3° ln fO jes Xjes 
0)/3030'. 
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This means that within each cluster we do not use the likelihood score for each 
observation as in the case of independent observations; instead, we replace it by the 
sum of likelihood scores over all cluster elements. 


Nonlinear Cluster-Specific Random Effects 


A quite general setup for cluster-specific effects in nonlinear models is to consider the 
estimator that minimizes or maximizes 
N. 


OBa,- 00) = 9 40e Xjes Brac), (24.46) 


c=1 j=1 
where cluster effects enter only via the scalar parameter œe, c = 1,...,C. 

A simple random effects model assumes that the a, are iid with parameters 6. Tak- 
ing expectation with respect to a, yields the objective function 


C Ne 
08,6) = Y f XO 4Ojo Xjes Bte) f (Ote|5)dere. 
c=1 j=l 


Estimation can be complicated, especially if there is no closed-form expression for the 
integral of the sum. 

Often it is easy to obtain the expectation with respect to one observation, 
Ee. ld jcs Xjes BAA] = G* (jc. Xjc, 3,6). Then the simpler estimator that ignores 
clustering and minimizes Q*(G,5) = >». yi q* (Vic. Xjc, 3,6) will be consistent, 
though the standard errors need to be adjusted for clustering using (24.45). 

For example, with count data we can develop a clustered analogue of the panel 
data Poisson-gamma mixture model. However, the Poisson quasi-MLE that ignores 
clustering can still be used as it is consistent, though standard errors need to be adjusted 
for clustering. 

Therefore, although random effects versions of nonlinear models can be developed, 
it is often adequate to estimate parameters by ignoring clustering and then correct the 
standard errors of estimators for the clustering. There can be little reason for estimation 
of clustered random effects models, aside from the potential for efficiency gains. 


Nonlinear Cluster-Specific Fixed Effects 


Nonlinear variants of the cluster-specific fixed effects model again maximize or mini- 
mize 


N, 


QB., 100) = $ Yo 4Ojo Xjes B,æe), 


c=1 j=1 


as in (24.34), except now the parameters œ1,..., œc are estimated rather than inte- 
grated out. 

For large clusters, that is, C small and N. — oo, we simply optimize 
O(B8,a1,...,@c) with respect to 8 and «1, ..., œc. Assuming that a,...,a@c¢ com- 


pletely control for any clustering, inference can be based on standard errors obtained 
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under the usual iid assumptions. This is the nonlinear analogue of the cluster-specific 
dummy variable model (24.41). 

For small clusters, that is, Ne small and C — ov, we have the problem of too many 
incidental parameters a, ..., @c. Unlike the linear model it is generally not possible to 
eliminate the parameters a, ..., @c (Hall and Severini, 1998). However, from Chapter 
23 on panel data we see that it is possible in some cases. 

For example, the binary logit model with cluster fixed effects specifies 


1 
~ TF exp(—a, — x8)’ 


Priyj- = 1) (24.47) 


where for identification X ;¢ cannot include an intercept or cluster-invariant regressors. 
The fixed effects œe can be eliminated using the conditional MLE that conditions on 
the sum of responses within a cluster, yii Yje = NcYe. The joint conditional proba- 
bility for the cth cluster is 


exp G Do XjeYje) 
rack, XP (8 DS X jeje) 


P(e ve tA] [Ne - Df +1] 
` TON, + 1) i 


Pr [Vic sees YNec [Nee] == 


(24.48) 


where Be = {(dic,..., dy.) | dnc = 0 or 1, and yi die = pay yjc}. The conditional 
likelihood is the product over all clusters of terms such as these, with clusters of size 
one excluded from the likelihood. The second term on the right-hand side does not 
depend on the unknown parameters and hence does not affect the maximization of 
the likelihood, so it can be ignored when considering maximization. The likelihood is 
awkward to maximize because the set B. ranges over the many ways of choosing N, 
outcomes yje = 1 from (Nj, + Noc) total outcomes in cluster c. Fortunately, however, 
a number of popular computer packages provide the conditional logit option for esti- 
mating this model. The covariance matrix of all unknown parameters is estimated by 
the inverse of the log-likelihood Hessian. 

As another example, consider the Poisson fixed effects cluster model, which spec- 
ifies 


Yje ~ Plujc = a, exp(x’,,3)], c=1,...,C, 


where P[-] denotes the Poisson distribution, and x;, excludes an intercept and any 
cluster-invariant regressors. This is the usual Poisson model, except that the usual con- 
ditional mean exp(x’;,3) is scaled multiplicatively by the cluster-specific fixed effect 
@c. For this particular model a variety of approaches, including conditional ML and 
concentrated ML, lead to elimination of the parameters a,. Consistent estimates of the 
parameters 8 can be obtained by solving the estimating equations 


Ne = 
yD Xie (vx = Fhe) = 0, 
5 c 


c=1 j=1 
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where Aj. = exp(x’,3) and Fe = Ne! > Yje and àc = N7! D jàje are cluster 
means. For further details see the discussion of this model in the panel data case in 
Section 23.7. 


24.5.7. Further Methods for Clustered Data 


The essential feature of clustering is that there is dependence across observations. 
A related topic is spatial correlation (see for example Anselin (2001), Lee (2004)), 
where the observational unit is a region, such as a state, and observations in regions 
close to each other are likely to be correlated. 

The random effects approach can be generalized to consider slope coefficients as 
well as the intercept. This is presented in the next section for hierarchical linear 
models. For nonlinear models the issues are similar to those for panel data presented 
in Chapter 23. 

The bootstrap can be used to obtain cluster-robust standard errors, in settings where 
clustering leads to correlation within a cluster but does affect estimator consistency. 
Intuitively, one should resample with replacement over clusters c, in which case we 
require small clusters with C — oo. At the bth bootstrap replication we draw C clus- 
ters with replacement and use all of the households j in these C resampled clusters to 
estimate the 0, that solves (24.44). Then one can estimate við] by applying the usual 
sample variance formula to 61, ets On, where B is the number of bootstrap replica- 
tion. Note that the resampling is done over clusters rather than households, since it is 
clusters that are assumed to be iid whereas there is within-cluster dependence. 


24.6. Hierarchical Linear Models 


Section 24.5 restricted the role of cluster effects in the random effects model to be 
confined to the regression intercept. A more general random effects model allows 
clusterwise variation in the slope parameters also. Intercluster variation in a subset 
of regression parameters could be linked to observable cluster characteristics. Be- 
cause such models involve several layers of specification, they are called hierarchical 
models. 

A standard framework for clustered data in many applied statistics disciplines is 
that of hierarchical linear models, also called multilevel linear models, random co- 
efficients models, variance components models, and mixed linear or mixed effects 
models. This class of models brings into the specification additional information. We 
begin with a presentation of the model for individuals clustered in groups. Then the 
model is adapted to short panels where repeated measures data are clustered for each 
individual. 


24.6.1. Model Structure 


A hierarchical or multilevel model is a model that can be applied to data with a nested 
structure. Examples are data on individuals within a region, such as a state or country, 
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or within an organizational unit, such as a school or community, or within a family if 
siblings data are used. Panel data are also an example, with repeated measures on the 
same individual interpreted as observations that are nested within an individual. 

We begin with a linear model 


Yj = XB; + uij, (24.49) 


where the innovation is to let the K regression parameters 6 vary by group (or cluster) 
j. A concrete example is to consider data on students within schools. Then y,; is 
an outcome measure such as test score for the ith student in the jth school, and the 
marginal effect of a change in a regressor such as race of the student varies across 
schools. Note that the standard hierarchical linear model (HLM) notation, which we 
use, reverses the subscripts compared to those in Section 24.5 where ye; would be the 
test score for the jth student in the cth school. 

The two-level hierarchical linear model specifies the coefficients in the level-one 
model (24.49) to be determined by a linear function of a random term and level-two 
variables, here school characteristics. Begin with the scalar parameter £;;, the kth com- 
ponent of the K x 1 vector parameter G;. Then xj is modeled as depending on a 
vector of school characteristics wą that take value wz; for the jth school, with 


By = Wye ty,  k=1,...,K, (24.50) 


where the first component of wz; is usually a constant. Stacking over all K components 
of B we have 


Bij Wij 0 0 Yı Vij 
: |=] 0 > 0 oe ae Ve 
Bxj 0 0 Wkj YK UKj 


or in obvious matrix notation 
B; =W;y+vj. (24.51) 


The model (24.50) is flexible and nests many models as special cases. These special 
cases include models with random intercepts and random slopes, but the framework 
additionally permits regression coefficients to vary with level-two observables w;. The 
range of models is very broad as the following indicates. 

The kth level-one coefficient is called a fixed coefficient if 4; = yg, in which case 
the coefficient does not vary with level-two regressors or with unobservables. If all 
level-one coefficients are fixed the model (24.49) reduces to y;; = x; jY + ij, in which 
case estimation by OLS regression is appropriate. Note that the term fixed coefficient 
has a very different meaning to the term fixed effect used by econometricians in the 
panel context. 

The kth level-one coefficient is said to be a nonrandomly varying coefficient if 
Bij = WkjYk- Then the coefficient is a linear function of school characteristics. If all 
level-one coefficients are fixed, except that the intercept is nonrandomly varying, the 
model (24.49) reduces to yi; = x}; +Wi;Yı + uij, which is a standard OLS regres- 
sion of the outcome on individual characteristics and school characteristics. 
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The kth level-one coefficient is said to be a randomly varying coefficient if 
Bj = Ve + vgj. Then the coefficient is purely random and does not vary with school 
characteristics. If all level-one coefficients are randomly varying, so that 3; = y + v;, 
the model is a variance components model or random coefficients model. If all level- 
one coefficients are fixed, except that the intercept is randomly varying, then the model 
(24.49) reduces further to yj; = Xj B + vij + uij, which is a random intercept model. 

In practice some of the level-one coefficients may be both nonrandomly and ran- 
domly varying, as in the general case (24.49). If just the level-one intercept follows 
the general model (24.49) whereas all other level-one coefficients are fixed, the model 
(24.49) reduces to yij = x}; 8 + W1;¥1 + Vij + uij. This is the usual pooled regression 
model, with error that has two components and is therefore correlated across individ- 
uals at the same school. 

The HLM framework can be extended to additional levels. For example, individual 
students (subscript i) may be nested in schools (subscript j), which are nested in a 
region (subscript k). Then the three-level HLM specifies at the first level the student 
outcome yj jx = Zi 47 jk + eijk, Where the parameters 7 j = Xj, + Ujx, and in turn 
Bi = Wiy + Wi 

The HLM can be reexpressed as a mixed linear model, since substituting (24.50) 
into (24.49) yields 


Yij = (Xy Way + X; Yj + tij. (24.52) 


The goal is to estimate the regression parameter ~y and the variances and covariances 
of the errors u;j and vj. Since the errors are assumed to be independent of regres- 
sors pooled OLS estimation of (24.52) yields consistent parameter estimates of y. The 
HLM approach uses more efficient estimators that exploit assumptions on the vari- 
ances and covariances of the errors u;; and vj. 

In the simplest case vz; are assumed to be iid NO, 07] and v j is assumed to be iid 
N[0, T]. Then the model can be represented as 


Vij ~ NIX; pa’, 


An early treatment of this was provided in a Bayesian setting by Lindley and Smith 
(1972), in which y are called hyperparameters, which in more general models can 
themselves depend in turn on higher level hyper parameters. The parameters y, o°, 
and T can be estimated by maximum likelihood methods or by Bayes methods. Alter- 
natively, ML methods can be used that are essentially the same as those for the mixed 
linear panel data model presented in Section 21.7. A complete treatment is given in 
Bryk and Raudenbush (1992, 2002). 


24.6.2. HLM for Panel Data 


The HLM literature interprets a short panel as repeated measures for an individual. 
Then the individual becomes level two in the two-level HLM, whereas the individual 
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was level one in the preceding section. The model (24.28) becomes 
Mi = x), Bi + Uti, (24.53) 


where, for example, y,; denotes an outcome measure at time ¢ for student i, and the 
marginal effect of changes in regressors such as specific subjects studied varies across 
students. The scalar parameter {;;, the kth element of the K x 1 vector parameter G,, 
is modeled as depending on a vector of individual characteristics w% that takes value 
w,; for the ith individual, with 


Bri = Wuye tui,  i=1,...,N. (24.54) 


The individual-specific effects model is the special case that all level-one coeffi- 
cients are fixed, so Bx; = yk, except that the intercept term #1; can vary across individ- 
uals (the level-two grouping). 

The individual-specific fixed effects model arises if there is no model for the inter- 
cept 61;, but instead 61; is directly estimated. This is an extreme case of a nonrandomly 
varying coefficient, with 61; = w{;7y;, where wi; is an N x 1 vector of indicator vari- 
ables with /th component equal to one if i = / and equal to zero otherwise so that 
Bi; = yii. The HLM framework is not designed to accommodate what econometri- 
cians call the fixed effects model. 

The individual-specific random effects model arises if the intercept 6); is a ran- 
domly varying coefficient, so that 6}; = yı + v1;. Clearly, much more general random 
effects models can be specified with Sg; also depending on regressors w,;. 

As already noted, the HLM is a mixed linear model. For the panel data case the 
analogue of (24.52) is 


Vi = (x); WDY F xvj + Urti. 


The random effects model of Chapter 21 is the specialization to yn = X}; Y + vj + Uri. 

A standard panel application of the HLM framework is to growth models, where 
the outcome yz; is individual intelligence or height, which is a function of age, and the 
marginal effect of age is permitted to vary across individuals. Here the slope coefficient 
in addition to the intercept is permitted to vary across individuals. 


24.7. Clustering Example: Vietnam Health Care Use 


In this section we focus on estimation in the presence of clustering, since this is the 
most common complication of survey data that appears in microeconometrics research. 
The methods in Section 24.5 are implemented. 

Both linear and nonlinear regression models are estimated based on individual- and 
household-level data from the World Bank’s Vietnam Living Standards Survey (VLSS) 
of 1997—1998. The survey collected detailed information on a variety of topics from 
over 27,700 individuals in approximately 6,000 households distributed over approxi- 
mately 194 communes. In what follows “commune” is treated as a cluster or a group 
and it is hypothesized that the observed outcomes are correlated within a commune. 
Average cluster size in the household sample is about 26, maximum cluster size is 39, 
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and minimum cluster size is 1. To illustrate linear and nonlinear cluster models three 
outcomes will be modeled. 

First, we consider a (log)linear regression model of total annual household health 
care expenditure (LNEXP12M), for households with positive expenditure, as a func- 
tion of the (log of) total household expenditure (HHEXP), controlling for several stan- 
dard sociodemographic variables, a type of “Engel curve” for health care expenditure. 
Of interest is the coefficient of total household expenditure, which is an estimate of the 
household income elasticity of demand for health care. 

Second, we use information on individual responses to estimate clustered count 
models for a type of health care that accounts for a high proportion of aggregate private 
health care expenditure. In modeling these outcomes we control for recent health status 
of an individual, household income, health insurance status, and various demographic 
variables such as age, sex, marital status, and educational attainment of the head of 
the household. Information about health status was restricted to ILLNESS or INJURY 
sustained in the survey period, the duration of illness, and number of days of restricted 
activity. The key coefficients of interest are again the coefficients on the income and 
insurance status variables. 

Table 24.3 provides the definitions and summary statistics for variables used in 
these examples. 

In both cases the key issues are the following: What is the impact of clustering on 
the estimate of this elasticity? How does the elasticity and its impact vary as different 
statistical assumptions, models, and estimators are used? 


24.7.1. Results and Discussion 


Table 24.4 gives the results for the OLS regression, HC f-ratios, fixed effects, and ran- 
dom effects formulations. There is a relatively minor change in standard errors result- 
ing from the use of a heteroskedastic-consistent variance estimator that does not take 
account of the clusters. However, when the cluster-robust variance estimator (24.34) is 
used there is a substantial change in the standard errors. The f-ratio for the expenditure 
elasticity drops from 16.01 to 12.68. All t-ratios become smaller and those for the two 
variables SEX and HHSIZE fall below 1.96. These results suggest, as expected, that 
ignoring intracluster correlation causes inflation in the OLS t-ratios. 

The F-tests of the null hypothesis that all fixed effects are equal rejects the null. 
The fixed effects results have essentially the same pattern but note that the t-ratios are 
even smaller. The point estimate of the income elasticity is now 0.60 compared with 
0.67 in the OLS results. However, overall there is no significant shift in the inference 
about the role of different variables. 

A x7(1) score test of the null hypothesis that the random variation in the intercept 
is zero, based on (24.43), easily rejects the null, indicating that the RE model is an 
improvement over the restricted regression. However, the estimated RE model also 
does not result in a significant change in the assessment of the role of different vari- 
ables. As expected the results presented under the FGLS columns and the RE (GLS) 
columns are very similar. The minor differences are essentially due to the different 
values used in the GLS transformation. The FGLS estimates are based on P = 0.12, 
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Table 24.3. Vietnam Health Care Use: Data Description 


Standard 
Household data Definition Mean Deviation 
LNEXP12M Total household health care expenditure 6.31 1.59 
for 12 months 
AGE Age of head of household 48.01 13.77 
SEX Equals 1 if the head of the household is 0.27 0.44 
female, 0 otherwise 
HHSIZE Total household size 4.73 1.96 
URBAN Equals 1 if urban household, zero otherwise 0.29 0.45 
EDUC Schooling year of household head 7.09 4.41 
HHEXP Total nominal household expenditure (1998 15273 13020 
VN dong) 
Individual data 
PHARVIS Number of direct pharmacy visits 0.51 1.31 
LNMEDEXP (> 0) log (total medical expenditure) for those with 2.14 1.08 
positive expenditure (1998 VN dong) 
AGE Age in years 29.7 9.67 
SEX Equals 1 if respondent is male 0.51 0.49 
MARRIED Equals 1 for married person 0.40 0.49 
EDUC Completed diploma level 3.38 1.94 
ILLNESS Number of illnesses experienced in 0.62 0.90 
past 12 months 0.62 0.90 
INJURY Equals 1 if injured during survey period 0.62 0.90 
ILLDAYS Number of illness days 2.80 5.45 
ACTDAYS Number of days of limited activity 0.06 1.11 
INSURANCE Equals 1 if respondent has health insurance 0.16 0.37 
coverage 0.16 0.37 
MEDEXP (> 0) Medical expenditure conditional on positive 21.04 208 
expenditure 
MEDEXP Medical expenditure (1998 VN dong) 6.13 112.75 


an estimate obtained by averaging 100 estimates of p obtained using 100 resampled 
pairs of least-squares residuals. 

The absolute differences between FE and RE results are relatively small. Informal 
comparison does not suggest that the FE and RE formulations yield substantially dif- 
ferent results; however, the Hausman test suggests that there is a statistically significant 
difference between the two sets of estimates. 

In summary, these results suggest that it is highly desirable to make some adjust- 
ment for intracluster correlation, and how exactly we do so appears to have a relatively 
small impact on the results. 

Next we consider the results for the counted variable, number of pharmacy vis- 
its (PHARVIS) by individuals, using the Poisson model. This is an interesting vari- 
able because a high proportion of medical expenditure in Vietnam takes the form of 
self-prescribed medication through the purchase and use of over-the-counter drugs 
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Table 24.5. Vietnam Health Care Use: Frequencies for Pharmacy Visits 


Visits 0 1 2 3 4 5 6 7 8 9 10+ 


PHARVIS 20639 3827 1716 776 359 174 64 43 16 4 115 
PHARVIS 744 137 .062 .028 .013 .006 .002 001 .000 .000 .004 
(fraction) 


purchased directly at pharmacies. This form of health care is assumed to be of lower 
quality than that obtained under professional supervision. In Vietnam eligible indi- 
viduals, usually high-income government and private sector employees, are able to 
purchase health insurance that entitles them to obtain care at government hospitals and 
to obtain prescribed medications there also. From Table 24.3 observe that 16% of the 
sampled individuals have such health insurance. 

Table 24.5 shows the observed frequency distribution of PHARVIS. About 26% of 
the individuals have one or more visits in the survey period and around 95% have a 
total of three or fewer visits. 

Table 24.6 presents the results for several variants of the Poisson regression, analo- 
gous to those in Table 24.4 for linear regressions. The first column gives the Poisson 
MLE estimates, and the ordinary unadjusted t-ratios are in the second column. The 
next column shows robust t-ratios based on heteroskedasticity-consistent variance es- 
timates. These are considerably smaller, in some cases by a factor exceeding 2, than 
the unadjusted ones. The fourth column gives cluster-adjusted t-ratios based on vari- 
ances calculated using (24.45). The fact that these are substantially smaller than those 
in the two preceding columns confirms that there is indeed significant intracluster 


Table 24.6. Vietnam Health Care Use: RE and FE Models for Pharmacy Visits 


Het Cluster Fixed Effects Random Effects 


Poisson Robust Robust Poisson Poisson 

Variables Coef. |t| |t| |t| Coef. |t| Coef. |t| 
CONS —1.637 35.78 18.81 12.25 — — 1318 19.41 
LNHHEXP .078 5.68 3.08 1.90 —.114 6.01 —.095 4.95 
INSURANCE —.245 9.57 5.68 4.29 —.163 6.17 —.178 6.44 
SEX .084 4.96 2.76 2.73 098 5.75 .099 5.71 
AGE .024 2.38 1.27 1.06 .003 0.32 .005 0.55 
MARRIED 124 5.92 2.96 2.78 164 7.59 158 7.38 
ILLDAYS .042 40.00 14.91 12.91 .046 40.14 .046 40.18 
ACTDAYS .008 1.71 0.43 0.45 025 4.53 .024 4,35 
INJURY 171 2.30 0.84 0.85 144 1.80 .143 1.80 
ILLNESS 562 87.15 24.60 21.81 584 73.45 .585 74.16 
EDUC —.052 11.10 6.47 3.92 —.024 4.18 —.026 4.61 
—ln L 25281 22446 23419 

N 27765 27671 27765 
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correlation. The average cluster size exceeds 140 observations; hence even a low de- 
gree of intracluster correlation is likely to inflate t-ratios substantially and the results 
confirm that. 

We next consider modeling the intracluster correlation using FE and RE models. 
The FE model is estimated using the conditional MLE. Some clusters that do not have 
sufficient intracluster variation are dropped. The estimated coefficients lead to dramat- 
ically different conclusions from those of the Poisson MLE estimates. First, note that 
the coefficient of In(HHEXP) switches from being significantly positive to being sig- 
nificantly negative. This means that the original regression suggested that a pharmacy 
visit is a normal good, but the FE estimates suggest that it is an inferior good; that is, 
individuals avoid this form of self-medication as income rises. This can be rationalized 
as the fixed effects picking up the influence of omitted variables that are correlated with 
the observed outcomes. These omitted variables could be the quantity and quality of 
alternative medical services available to commune residents. These could vary a great 
deal depending upon the geographical location and economic status of the communes. 

The last two columns in Table 24.6 give results based on random effects formula- 
tion. Here it is assumed that the intercept in the Poisson distribution varies randomly 
over clusters, and each cluster “draws” its intercept from a common univariate distri- 
bution, specifically a gamma distribution with unit mean. This formulation is attractive 
because it does not require conditioning. The RE Poisson panel model with gamma- 
distributed intercept, developed by Hausman et al. (1984), has an analytical likelihood 
function that can be adapted for clustered data. The estimates obtained for the RE 
model are qualitatively similar to those from the FE model. However, the estimated 
coefficient for the key income variable has shifted a long way from that obtained un- 
der the simple Poisson assumption. 

This example shows that intracluster correlation may have an impact not just on 
efficiency alone but also on the estimates themselves. 


24.8. Complex Surveys 


The discussion in preceding sections focused on stratification, weighting, and clus- 
tering in isolation. Here we focus on complex surveys that use a stratified multistage 
cluster sampling design. The intent of such surveys is to present a population summary 
when population parameters may vary across strata. Then a weighted estimator is used 
and is viewed as an estimate of the census parameter. The goal is to consistently esti- 
mate the variance of the weighted estimator, controlling for clustering that can be more 
complicated than that in Section 24.5. 


24.8.1. Variance Estimation in Complex Surveys 


We consider the following setup. The ith observation in the sample is household j in 
cluster c in strata s. For example, the dependent variable is denoted yscj, though more 
formally the observation (s,c, j) may be represented as observation (s, Cs, jc,). The 
data are (Yscj, Xscj, Wscj), Where Wscj are sample weights inversely proportional to the 
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probability of selection of the observation in the sample. The subscripts are ordered in 
terms of level of disaggregation, a reversal from the notation of Section 24.5. 

Two-stage or multistage sampling is used within strata, with households selected as 
the result of at least two sequential draws. First, a subset of all PSUs within the strata 
is randomly drawn. Second, a subset of all households in the selected PSUs is drawn, 
where clustered sampling may be permitted. Further draws within an SSU and so on 
are also possible. 


Variance of a Linear Statistic 
The starting point is to consider estimation of the variance of a linear statistic that sums 
over strata, PSU, and households: 


Cs Nes 


S S 
=) Use =) lsc, 


s=1 c=1 j=1 s=1 c=1 


where usc are the totals within a PSU, so 


Examples of usc; such as the weighted mean and weighted regression are given in the 
following. The variance of u is 


V[u] = YM = 5 Cso, 


s=l c= s=l1 


if we assume that usc are independent over strata and ae lid over PSUs with common 
Teran o? . The usual unbiased variance estimate of o? can be used, given tsc iid over 
c, so 62 = (Cy — 1)! X (use — its). It follows that 


S 


a Ç: Cs E 
V[u] = C - 1 X us i is)’, (24.55) 
s=l “S > c=l 


where ui; = Go >>. Use is the stratum average of the PSU totals. 
This estimator allows for clustering within a PSU, since 


C, Cs 


s [Na $ 
X use = is)” = > (3: scj — n.) 


c=1 c=1 \j=l 
Cs Nes Nes 


2 Sus = is)” +F 5 > X usoj ai its )(Usck i its). 


c= c=1 j=l k#j 


The first sum is the contribution to the variance under SRS. The second sum will 
be positive under clustered sampling and leads to a larger variance. No assumption 
has been made about the nature of the sampling within strata nor about the type of 
clustering that arises. For example, (24.55) gives correct standard errors even if there 
is three-stage sampling with further subsampling with SSUs. 
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The estimator (24.55) does require that at least two PSUs be drawn from each strata. 
If only one PSU is drawn then one possibility is to collapse the strata that includes the 
single PSU into another strata that is viewed a priori as being reasonably similar. It is 
feasible provided C, > 2, that is, if there are at least two PSUs per stratum. This will 
lead to overestimation of V[u] as an upward bias is introduced because of the different 
means in different strata. ! 

In practice PSUs are sampled without replacement so there is some dependence in 
usc. Then (24.55) overestimates V[u], similar to the situation in Section 24.2.3. More 
complicated formulas have been proposed. 


Variance of the Weighted Mean 
The population mean is estimated by the ratio of the sample-weighted total of y,.;, say 
y, to the sum of the sample weights, say Ù. Then 


Cs Nos Cs Nes 


S 
jw = 9/0 = Sa D Wscj- 
s= 


s=1 c=1 j=l 1 c=1 j=l 


If the sample weights are treated as known, then more simply 


S 
Jw = = > D 3 Weej Yscj > 
s=l c=1 j= 
where Wij = = Wscj /W and V[ Hw] can be applied using (24.55) with usc; = Wrej Yscj- 


If the sample weights are treated as unknown then the delta method or lineariza- 
tion method can be used to obtain V[y/®] as a function of V[Y], V[©], and Cov[y, w]. 
The first two quantities can be estimated using (24.55) with Uscj = WscjYscj and 
Uscj = Wscj. The third quantity can be estimated with (us. — iis)” in (24.55) replaced 
by (Usc — Us)(Use — Bs), where Use) = WscjYscj ANA Vscj = Wscj. This is an example of 
a ratio estimator. 

For nonlinear statistics such as these ratio estimates, the literature proposes other 
estimates based on the jackknife and balanced repeated replication. Because of the 
nonlinearity the variance estimates are no longer unbiased but can be shown to be 
consistent if the number of strata S — oo (see Krewski and Rao, 1981). Some results 
with S fixed and pen Nes > œ are summarized in Wolter (1985). One can also 
bootstrap, though care is needed. See Rao and Wu (1988) and Shao and Tu (1995). 


Variance of Weighted Least-Squares Estimator 


From Section 24.3, the weighted regression estimate Bw of the census regression pa- 
rameters solve 
Cs Nes 


S 
> Wei Xc Osei _ Xj Bw) =0 


s=l c=1 j=l 


a 


1 For the CPS the method here cannot be directly applied as many strata have only one PSU and for other strata 
only one PSU is collected. Instead, various pseudo-strata are formed and replication methods are used that 
resample PSUs from the pseudo-strata. See U.S. Census Bureau (2002). 
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By the usual algebra, we have 


Nes 


Pi S G Nes SD a Aë A 
Bw = B = > 5 De WscjXscjXscj X > Wsej Vsej ran Xe; Bw): 


s=1 c=1 j=l s=1 c=1 j=l 


This leads to the sandwich form VIB] = A~'BA™!, where B is the variance of the 
second triple sum, which can be estimated using (24.55) with uscj = WsejXsej Vsej — 


Xvej Bw). 


Variance of Weighted m-Estimator 


A quite general framework considers the weighted m-estimator Ow that solves 
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Examples include linear regression, hy.j = Xscj(Yscj — Xhe j ' B), and quasi-maximum 
likelihood, hsc; = 3 In f (Yscj|Xscj, 0)/30. 

Assuming consistent estimation of 0, which requires that E[h(yscj, Xscj, 0)] = 0 
we can use the usual first-order Taylor series expansion on the estimating equation 
to get 
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where the expression for B assumed independence of hsc; over strata and clusters but 
permits dependence within a cluster. Estimation of A is straightforward. For B use 
(24.55) with Uscj = WscjBscj, SO 


> Nes 5 -1 Cs > 
where Zsce = pe WscjA(Yscj, Xscj, 0) and Zs = C; pak, Zs. 


Endogenous Stratification 


Sakata (1998) extends these results to endogenous sampling. He takes a census param- 
eter approach and provides asymptotic theory assuming the number of strata $ —> oo. 
The results are the same as those given in the previous section. 
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24.9. Practical Considerations 


It is most common in microeconometrics research to take a structural approach. Un- 
weighted estimators are used, provided there is no endogenous stratification. The main 
concern is to obtain correct standard errors if clustering is present. If cluster effects are 
random there is usually little efficiency loss in ignoring clustering in estimation. Some 
packages may have a cluster robust standard errors option, not to be confused with a 
heteroskedasticity robust option, which is appropriate if cluster effects are random and 
there are many clusters. The CSRE and CSFE models can be implemented using OLS, 
provided in the case of CSFE there are not too many clusters. Alternatively, a panel 
data module can be used if it supports unbalanced panels. As with panel data most 
researchers outside econometrics are content to take a random effects approach, but a 
fixed effects approach may be necessary for consistent estimation. 

If a descriptive approach is taken and parameters vary over strata then weighting 
is necessary. A weighting option within least squares can be used, but it needs to be 
combined with a cluster-robust standard errors option. Some packages have a survey 
estimation module that obtains cluster-robust standard errors using the methods of 
Section 24.6. The package SUDAAN implements many of the methods in this chapter 
for linear and leading nonlinear regression models. 


24.10. Bibliographic Notes 


24.2-24.3 The literature on survey sampling is vast. Classic references on sample surveys in- 
clude Kish (1965) and Cochran (1977, first edition 1953). Skinner (1989) provides 
a useful overview and Groves (1989) provides a relatively nontechnical treatment 
that presents the approaches of many of the social sciences to surveying, while rais- 
ing many useful practical issues. For completeness we have incorporated some of 
this survey sampling literature, though econometrics studies rarely implement the 
methods in Section 24.8. There are few econometrics references, with the notable 
exception of chapters in Pudney (1989) and Deaton (1997) and a book chapter by 
Ullah and Breunig (1998). 

24.4 The main focus of the theoretical econometrics literature has been controlling for en- 
dogenous stratification. This literature is challenging and we have merely provided 
an overview. For detail see Amemiya (1985), who provides many references includ- 
ing Manski and Lerman (1977) for discrete-choice models and Hausman and Wise 
(1979) for sample selection models. The simple weighted estimator is generally ap- 
propriate albeit inefficient. Imbens and Lancaster (1996) present a practical way to 
implement a fully efficient estimator given specification of the conditional density. 

24.5 For microeconometrics applications controlling for clustering is of greatest impor- 
tance. The works by Kloek (1981) and Moulton (1986, 1990) were key in alerting 
econometricians to this problem. Davis (2002) gives a general treatment of multi- 
way error component models. Graubard and Korn (1994) provide a useful discussion 
of linear regression analysis of clustered data. They pay attention to both fixed and 
random effects models, with emphasis on the assumptions that must be satisfied for 
the random effects model to be valid. Pendergast et al. (1996) provide an extensive 
survey of the methods for analyzing clustered binary data. Because the middle term 
on the right-hand side of (24.34) involves averaging over the number of clusters, the 
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precision of this estimate depends on the number of clusters. The consequences of 
using the cluster-robust variance matrix when the number of clusters is small con- 
tinues to be a topic of research (Donald and Lang, 2001; Angrist and Lavy, 2002). 
Wooldridge (2003) provides an overview. 


24.6 Hierarchical linear models have been extensively used in social sciences. Bryk and 


Raudenbush (2002) provide a comprehensive coverage of binary, ordered, counted, 
and multinomial outcomes from both likelihood and Bayesian perspectives. 


24.7 Deaton (1997) examines a number of issues of modeling using data from clustered 


samples from various Living Standards Surveys conducted in developing economies 
by the World Bank. 


24.8 Many standard statistical software packages (e.g., STATA and SUDAAN) accommo- 


24-1 


24-2 


24-3 


24-4 


date both fixed and random effects formulations of clustering in linear and nonlinear 
models for cross-section and panel data. 


Exercises 


(a) Verify the expression for ©, given at (24.25). 

(b) Prove the consistency property of the estimators £? and p in the CSRE 
model. 

(c) Consider the bias of the standard errors in the balanced cluster CSRE 
model. Show that in this case EIX, )> ; 04] = o®[N — K(1 + p(m—1))].- 


(Adapted from Greenwald, 1983) Consider the linear regression model 
y = XG+u, where E[u] = 0 and E[uu’] = o?* = ©. By standard results for 
the OLS estimator B = (X’X)~'X’y (see Section 4.4) we can obtain the correct 
expression for V[@] as V2 = (X’X)-1 (X/QX)-1(X'X)-1, whereas V; = 2(X’X)~" 
with £? = WU/(N — K) is invalid if Q £1. 

(a) Show that the bias of V; is given by B= B; +B, where 
B2 = (X’X)-'X’(Q. — o°I)X(X'X)-! and By = (N— K)~" tr{Bo(X’X)}(X’X)71. 
(Greenwald refers to Bz as “direct bias.”) 

(b) Evaluate the two terms for the special case of X’X = Ix. Show that B > Bo 
as N > oo. 


Consider the OLS cluster-robust variance estimator formula (24.33). Suppose 
there are two levels of clustering. Specifically, in the context of the empirical 
example of this chapter, clustering could be at the level of family and commune 
if multiple members of the family from the same commune are included in the 
survey. How will the formula be modified if the data have two levels of clustering? 


For this exercise use a 50% sample of the VLSMS data. Define y= 1 if the 
subject has at least one pharmacy visit (PHARVIS) and y= 0 otherwise. This 
example presumes access to a program that handles clustering. 


(a) Using the same explanatory variables as those for the Poisson model in 
Section 24.7, estimate a binary logit model by maximum likelihood, us- 
ing both the standard estimator and the robust sandwich estimator for the 
variance. 

(b) Reestimate the specification of part (a) using the cluster-robust standard 
error option. Explain the differences between the robust standard errors of 
parts (a) and (b). 
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(c) Use the “commune” variable as a cluster identifier. Reestimate the logit 
model using the cluster fixed effects and cluster random effects specifi- 
cation. Compare the estimates and standard errors of the coefficients of 
LNHHEXP and INSURANCE. Are the conclusions about the significance of 
these variables affected by clustering in the data? 
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CHAPTER 25 


Treatment Evaluation 


25.1. Introduction 


The topic of treatment evaluation concerns measuring the impact of interventions on 
outcomes of interest, with the type of intervention and outcome being defined broadly 
so as to apply to many different contexts. The treatment evaluation approach and some 
of its terminology comes from medical sciences where intervention frequently means 
adopting a treatment regime. Subsequently, one may be interested in measuring the 
response to the treatment relative to some benchmark, such as no treatment or a differ- 
ent treatment. In economic applications treatment and interventions usually mean the 
same thing. 

Examples of treatments in the economic context are enrollment into a labor train- 
ing program, being a member of a trade union, receipt of a transfer payment from 
a social program, changes in regulations for receiving a transfer from a social pro- 
gram, changes in rules and regulations pertaining to financial transactions, changes 
in economic incentives, and so forth; see Moffitt (1992), Friedlander, Greenberg, and 
Robbins (1997), and Heckman, Lalonde, and Smith (1999). If the treatment that is 
applied can vary in intensity or type, we use the term multiple treatments when re- 
ferring to them collectively. Relative to a single type of treatment this does not create 
complications, but now the choice of a benchmark for comparisons is more flexible. 

The term outcome refers to changes in economic status or environment on eco- 
nomic outcomes of individuals. A leading case is one in which the outcome of interest 
is a continuous variable, say y, whereas the treatment variable is discrete and of on/off 
variety, say D, where D takes the value | if the treatment is applied and is 0 otherwise. 
An example of an intervention is labor market training, which could affect posttraining 
wages of the worker. In general, however, either the outcome or treatment can be con- 
tinuous or discrete or exhibit limited variation. Whereas the details of the analysis will 
vary, certain key ideas will be relevant in all situations. For simplicity, we will take the 
case of a continuous outcome and a binary-valued treatment as our leading case. Later 
we will extend the analysis to other practically relevant situations. 
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Policy relevance of treatment evaluation is direct because “successful” treatments 
can be linked to desirable social programs, or improvements in existing programs to 
attain objectives of social policy. Heckman and Smith (1998) have discussed the rela- 
tionship between several commonly used measures of treatment impact and traditional 
cost-benefit analysis. 

The standard problem in treatment evaluation involves the inference of a causal 
connection between the treatment and the outcome. In a canonical single-treatment ex- 
ample we observe (y;, X;, Di), i = 1,..., N, and the impact of a hypothetical change 
in D on y, holding x constant, is of interest. Such inference is the main feature of 
the potential outcome model, already introduced in Chapter 2, in which the outcome 
variable of interest is compared in the treated and nontreated states. However, no in- 
dividual is simultaneously observed in both states. Hence, the situation is akin to one 
of missing data, and it can be tackled by methods of causal inference carried out in 
terms of counterfactuals. We ask how the outcome of an average untreated individual 
would change if such a person were to receive the treatment. That is, a magnitude like 
Ay/AD is of interest. Fundamentally one’s interest lies in the outcomes that result 
from, or are caused by, such interventions. Here causation is in the sense of ceteris 
paribus, meaning that we hold all other variables constant. 

What is the difference between this chapter and earlier ones in which we also con- 
sidered the identification and estimation of a variety of models? There are many sim- 
ilarities and the differences arise from a shift of emphasis. The main difference stems 
from the focus on a family of measures of treatment effectiveness. These measures are 
functions of parameters and data, and they enable comparisons with policy-relevant 
counterfactuals. An important and interesting result is that not all measures can be 
constructed, given the data and the estimator. The choice of an estimator and the type 
of data used in model estimation place restrictions on the counterfactuals that can be 
identified, and hence on the impact measures that can be consistently estimated. 

Another emphasis in the literature on treatment evaluation is on the advantages of 
identification secured using minimal functional form and exclusion restrictions, (e.g., 
semiparametric identification). This emphasis is motivated by the desire to produce 
results that have policy significance but whose validity does not depend on strong 
assumptions. The feasibility of semiparametric identification is relatively easier to 
establish for treatment effect estimation in linear models, with continuous support 
for the dependent variable, than it is in nonlinear models with limited dependent 
variables. 

Section 25.2 discusses identification assumptions. Section 25.3 presents mea- 
sures of treatment effect that are usually targeted in identification and estimation. 
Section 25.4 analyzes matching and propensity score estimators. Differences-in- 
differences estimators of treatment effects that are common in event studies with a 
quasi-experimental data setup are covered in Section 25.5. Continuing with a quasi- 
experimental setup, we discuss the regression discontinuity design in Section 25.6, fol- 
lowed by the instrumental variable estimator in Section 25.7. Much of the discussion 
up to this point is related to linear models. Section 25.8 provides a detailed empirical 
illustration of the methods developed in the chapter. 
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25.2. Setup and Assumptions 


The methods for estimation of treatment effects rely on assumptions to permit iden- 
tification of causal effects just as, for example, the linear SEM relies on assumptions 
to permit causal effects (see Chapter 2). In this section we detail the assumptions that 
permit use of the key matching and propensity score estimators that are presented later 
in Section 25.4. 

First we consider a framework for estimating causal parameters in treatment 
evaluation. 


25.2.1. Treatment Effects Framework 


Let us begin with the setup of randomized treatment assignment in a social experiment 
as described in Section 3.3. Let there be a target population for the treatment of interest 
and let N denote the number of randomly selected individuals who are eligible for 
treatment. Let Ny denote the number of randomly selected individuals who are treated 
and let Nc = N — Nr denote the number of nontreated individuals who serve as a 
potential control group. 

Random assignment implies that the treatment assignment ignores the possible 
impact of the treatment on the outcomes. For example, no one is included in the 
treatment group on the grounds that the benefit of the treatment to that individual 
would be large, and no one is excluded because the expected benefit is small. Let 
(Yi, Xi, Di; i = 1,..., N) be the vector of observations on the scalar-valued outcome 
variable y, a vector of observable variables x, and a binary indicator of a treatment 
variable D. For simplicity, we assume that anyone who is assigned treatment gets it, 
and anyone who is not does not get it. The outcome variable of the treated individual is 
denoted y; and that for the nontreated individual is denoted yo. After the experiment is 
run and data are collected, we would like to obtain a measure of the treatment impact. 
The most natural way of measuring the effect of the treatment would be to construct a 
measure that compares the average outcomes of the treated and nontreated groups. 

With one important difference the same data setup could be applied to observational 
data. The difference is that there is no random assignment mechanism for treatment, 
perhaps because individuals choose to be treated, or because of some other reason. 

It needs to be stated at the outset that most treatment evaluation studies have a par- 
tial equilibrium character. Specifically, they assume an absence of general equilibrium 
effects. By that we mean that the treatment effects are small and do not affect the sta- 
tus of some of the variables that are treated as exogenous. This assumption will not 
do if one were considering a treatment program that affected an entire sector that was 
a significant part of the national economy. For example, instituting universal health 
insurance may have impact on the entire health services sector, which would make it 
difficult to apply the methods discussed in this chapter. 

There are potential pitfalls in constructing estimates of treatment effects. There are 
also subtle differences of interpretations that arise from variations in the assumptions 
used to construct such measures. Therefore, we begin by examining these assumptions. 


862 


25.2. SETUP AND ASSUMPTIONS 


25.2.2. Conditional Independence Assumption 


Meaningful comparisons between the outcomes of the two groups require some as- 
sumptions. We shall initially list and explain these assumptions and later use them in 
the discussion of identifiability of certain treatment effects. 

An important assumption is the conditional independence assumption that states 
that conditional on x, the outcomes are independent of treatment, written as 


yo. yı L D| x. (25.1) 


Behavioral implication of this assumption is that participation in the treatment program 
does not depend on outcomes, after controlling for the variation in outcomes induced 
by differences in x. Random assignment, properly applied, will validate this assump- 
tion. Indeed, under completely random assignment one may even make a stronger 
assumption 


yo, yı L D, (25.2) 


because randomization would be over (y, x) space. The more commonly used assump- 
tion (25.1), if valid, can be useful for identification of some impact parameters because 
it states that once we control for the effects of regressors x, some of which may be re- 
lated to D, treatment and outcomes are independent. 

The conditional independence assumption is broad and implies the following: 


F(y;lx,D = 1) = F(y;lx, D = 0) = F(y;|x), j= 0,1, (25.3) 
F(u;|x,D = 1) = F(u;|x,D = 0) = F(u;|x), j = 0,1, 


where u is the regression model error, which means that the participation decision does 
not affect the distribution of potential outcomes. 

To see the impact of this assumption let E[y|x, D] be linear; that is, the outcome- 
participation equation is 


y=xB+aD+u, (25.4) 


where E[u|D] = E[y — x‘G—aD|D] = 0. Therefore, D may be treated as an exoge- 
nous variable, and there will be no simultaneity bias or selection bias. Under the stan- 
dard conditions on x, consistent estimation of regression parameters is possible. 

An assumption that is weaker than (25.1) is 


yo L DI x, (25.5) 


which implies conditional independence of participation and yo. This assumption is 
used in establishing identifiability of a population-average treatment effect on the 
treated (ATET), as will be seen later. 

Assumption (25.5) has other names in the literature. Imbens (2005) refers to it as the 
unconfoundedness assumption and Rubin refers to it as the ignorability assumption 
(Rubin, 1978; Wooldridge, 2001). If valid, the assumption implies that there is no 
omitted variable bias once x is included in the regression, and hence there will be 
no confounding. The assumption is tantamount to treatment assignment that ignores 
outcomes; hence it is appropriate to refer to it as the ignorability assumption. 
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This assumption is necessary if the treatment variable is to be treated as exogenous, 
which is essential for simplicity in estimation. If valid, sample selection models or IV 
methods to handle endogenous treatment variables are not needed, and the methods of 
Section 25.4 can be applied. 


25.2.3. Matching Assumption 


A second assumption, referred to as the overlap or matching assumption, is neces- 
sary for identifying some population measures of impact. It states that 


0 < Pr[D = 1|x] < 1. (25.6) 


This assumption ensures that for each value of x there are both treated and nontreated 
cases. In that sense there is overlap between the treated and untreated subsamples. For 
each treated individual there is another matched untreated individual with a similar 
x. If the assumption were to fail, then we could potentially have individuals with x 
vectors who are all treated and those with a different x who are all untreated. This 
condition is not required for identifying the treatment parameter for the treated group. 
For identifying the treatment effect on a randomly selected individual one needs for 
each participant an analogous nonparticipant. Then the condition Pr[D = 1|x] < 1 is 
sufficient. 


25.2.4. Conditional Mean Assumption 
A third assumption is the conditional mean independence assumption 
E[yo| D = 1, x] = E[yol D = 0, x] = ELyo| x], (25.7) 


which implies that yọ does not determine participation. 


25.2.5. Propensity Scores 


When treatment participation is not by random assignment but depends stochastically 
on a vector of observable variables x, as in observational data or when the treatment is 
targeted to some population defined by some observable characteristics (such as age, 
sex, or socioeconomic status), then the concept of propensity scores is useful. This 
is a conditional probability measure of treatment participation given x and is denoted 
p(x), where 


p(x) = Pr[D = 1|X = x]. (25.8) 


The propensity score measure can be computed given the data (D;, x;) using any of 
the parametric or semiparametric methods covered in Chapter 14 (e.g., by doing a logit 
regression). 

An assumption that plays an important role in treatment evaluation is the balancing 
condition, which states that 


D Lx| p(x). (25.9) 


864 


25.3. TREATMENT EFFECTS AND SELECTION BIAS 


Table 25.1. Treatment Effects Framework 


Symbol Definition 

yı Outcome for the treated group 

yo Outcome for the nontreated group 
P(x) Propensity score 

Nr Number of treated cases in the sample 


This can be expressed alternatively by saying that for individuals with the same 
propensity score the assignment to treatment is random and should look identical in 
terms of their x vector. The balancing condition is a testable hypothesis. 

A useful result about conditional independence given p(x) due to Rosenbaum and 
Rubin (1983) states that 


yo. Yi L D| x= yo. y1 1 D| p). (25.10) 


This implies that the conditional independence assumption given x implies conditional 
independence given p(x), that is, independence of yo, yi, and D given p(x). 
To obtain this result, note that 


Pr[D = I|yo, y1; P(&X)] = ELD |yo, yı, pœ] 

= E[ELD |yo, 1, p(X), XIlyo, y1, POO] 

= E[E[D |yo, y1, xIlyo, y1; PŒ] 

= E[E[D [x]lyo, v1, pœ] 

= E[p()|yo, 1, PE) 

= p(x). 
Here the second and third lines follow from the law of iterated expectations. The fourth 
line uses conditional independence. The intuition behind this result is that p(x) is a 
particular function of x and, in a sense, contains less information than x. Hence con- 
ditional independence given p(x) is implied for the same given x. Because by condi- 
tioning on x we get rid of the correlation between x and D, likewise by conditioning 
on the propensity score p(x) we also expunge the correlation between x and D. Thus 
a regression similar to (25.4) is 


y=xB+ap(x)+u (25.11) 
= xB + ap(x) + (u + a(p(x) — P(X), (25.12) 


where in the second line the unknown p(x) is replaced by a sample estimate, resulting 
in the addition of the sampling error to the regression error. The pros and cons of this 
strategy will be considered later. Table 25.1 summarizes the notation. 


25.3. Treatment Effects and Selection Bias 


We begin by presenting two-widely used measures of treatment effect — one that aver- 
ages over all individuals and one that averages over only the treated. We then discuss 
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in some detail the role of selection into treatment. The methods presented in Sections 
25.4—25.6 presume that selection effects directly depend on only measurable observed 
characteristics of the individual, such as age. If additionally selection effects depend 
on unobservables then the methods of Chapter 16 must instead be used. The current 
section includes considerable discussion of selection issues. 


25.3.1. Two Key Parameters: ATE and ATET 
Define A as the difference between the outcome in the treated and untreated states 
A = yi — yo, (25.13) 


where we may condition on x if desired. It is emphasized that A is not directly ob- 
servable because no individual can be observed in both states. Population values of the 
average treatment effect and average treatment effect on the treated are defined as 


ATE = E[A], (25.14) 
ATET = E[A|D = 1], (25.15) 
with sample analogues 
aa 
ATE = = 2 IA], (25.16) 
ee ee 
ATET = N, X IAID; =l]; (25.17) 


i=1 


where Nr = ye , Di. In each of these two cases, computation is straight-forward if 
A; can be obtained. The procedure is not direct because the formulas have an unob- 
served component that must be estimated and that step calls for some assumptions. 

The ATE measure is relevant when the treatment has universal applicability so that 
it is reasonable to consider the hypothetical gain from treatment to a randomly selected 
member of the population. The ATET measure is relevant when we want to consider 
the average gain from treatment for the treated. See Heckman and Vytlacil (2002). 

To understand the treatment evaluation problem consider the average gain from 
participation given characteristics x. This is 


ATE = E[A| X = x] (25.18) 
= E[yi — yo|X =x] 
= Ely |X = x] — Elyo|X = x] 
= E[yi |x, D = 1] — Elyolx, D = 0], 
where the last equality uses the conditional independence assumption (25.1). 
Given a sample of participants, E[y,;|D = 1,x] can be estimated. However, 
E[yo|x, D = 0] is not observable because it is a measure of the average outcomes 
for the participants had they in fact not participated, and one cannot simultaneously 


observe the same individuals as both participants and nonparticipants. To make ATE 
operational we must find an estimator for the second term. 
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By definition (25.18) 


ATE = E[y;|x, D = 1] — Ef yo|x, D = 0] (25.19) 
= m(x) — M(x) + Elui |x, D = 1] — Eluo|x, D = 0] 
= U(X) — M(x) + Elui |x] — E[uo|x] 
= [44 (X) — Mok), (25.20) 


where the first term in the first line on the right-hand side can be estimated using 
the data from treatment participants, but the second term is not directly observable. 
The next three lines follow by applying the conditional independence and conditional 
mean assumption and adopting the specifications yj = u(x) + u; for the treated and 
Yo = [4o(X) + uo for the untreated. The second from the last line only requires mean 
independence rather than full conditional independence. 


25.3.2. Sampling and Selection Bias 


The crux of the evaluation problem is that E[yo|x, D = 1] is unobservable. The solu- 
tion to the problem depends in part on the type of data available. Social experiments 
use the eligible participants that are excluded from the treatment group as a proxy 
for the counterfactual. Observational studies generate a comparison group from the 
same source as the treated group, or from other databases, and essentially end up us- 
ing some function of E[-yo|x, D = 0] that can be estimated using data from nonpartic- 
ipants. The simplicity of the computation when the data come from a well-designed 
and executed social experiment should be viewed against the background of actual 
social experiments, which are subject to other problems such as randomization bias 
and substitution bias (discussed in Chapter 3). 
Suppose that for the treated participants the outcome equation is 


yi = E[yi |x] + u (25.21) 
= py (x) +4 (25.22) 


and for the nonparticipants the equation is 


yo = ELyolx] + uo (25.23) 
= uo (X) + uo. (25.24) 


Note that this specification is of the switching regression type (analogous to the Roy 
model discussed in Section 16.7) in the sense that the treated and nontreated have 
different conditional mean functions, u; (x) and fp (x), that are written in a more 
general notation than necessary for the purely linear model. We assume that E[u; |x] = 
E[uo|x] = 0, though E[w;|x, D] and E[uo|x, D] do not necessarily equal zero. 

A more common, but restrictive, specification has 


ui (XK) = Mo (x) + aD, (25.25) 


in which the treated group has an additional intercept component œ, but the slope 
coefficients of the regressors are unaffected by the treatment. 
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Table 25.2. Treatment Effects Measures: ATE and ATET 


Measure Treatment Effect Special Case (25.25) 
ATE given x E[A|x] = u(x) — (x) E[A|x] =a 

ATET with x E[A|x, D = 1] E[A|x, D = 1] 

and selection effect = wX) — Mo (X) =a+E[u; — uo|x, D = 1] 


+ E[u; — uo|x, D = 1] 


Additional benefit E[u; — uo|x, D = 1] E[u; — uo|x, D = 1] 
to individual with x 


Average selection bias E[uo|x, D = 1] E[uo|x, D = 1] 
— E[uo|x, D = 0] — E[uo|x, D = 0] 


The observed outcome y is written as 
y = Dy; + (1 — D)yo. (25.26) 
Combining these equations we get 


y = D (m Œ + u1) + (1 = D) (uo (x) + uo) 
= uo (X) + D (14; (X) — Ho (X) + u1 — uo) + uo. (25.27) 


Because D = 1 or 0, the second term in the regression “switches” on and off. The 
second term in (25.27) measures the benefit of participation; the first component 
[4\(X) — uo(X) measures the average gain to a participant with characteristics x and 
the second component (uı — uo) is individual-specific benefit. The second component 
may be observable by the participant, but not by the investigator. 

The expressions for ATE and ATET are given in Table 25.2, for the general case 
and the specialization (25.25). 

Average selection bias is the difference between program participants and nonpar- 
ticipants in the base state. This effect cannot be attributed to the program. A special 
case is E[u; — uo|x, D = 1] = 0, which can arise if there are no unobservable compo- 
nents of the benefit or if the best individual estimate of u; — ug is zero. 

Selection bias arises when the treatment variable is correlated with the error in the 
outcome equation. This correlation could be induced by incorrectly omitted observable 
variables that partly determine D and y. Then the omitted variable component of the 
regression error will be correlated with D — the case of selection on observables. 
Another source comprises unobserved factors that partly determine both D and y. This 
is the case of selection on unobservables. The conditional independence assumption 
essentially rules out confounding caused by omitted variables. 
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25.3.3. Selection on Observables 


In observational data the problem of selection on observables is solved using regres- 
sion and matching methods. Subsequent sections of this chapter present these methods 
in detail. Before doing so, we note that the two-part model of Section 16.4 is an exam- 
ple, and in this section we discuss a second straightforward method. 

The control function estimator is motivated by the possibility that a set of observ- 
able variables z that determine D may be correlated with the outcomes. For concrete- 
ness let us consider the special case where the outcome equation is 


yi = XB +a D; + ui (25.28) 
and the error is such that 
E[u;|x;, Di] = Elu:|x;, Di, zi]. 
In the case of selection on observables we may have E[u; |z;] Æ 0. Let us write 
E[yi|xi, Di, zi] = x, + wD; + Eu; |x, zi], (25.29) 


which motivates the use of a control function estimator based on the OLS/GLS es- 
timation of the equation. The essential idea is to introduce into the outcome equation 
all observable variables that could possibly be correlated with u; and then estimate the 
resulting equation by least squares. Specifically, 


yi = C,6 + aD; + {ui — Elui|D;, Ci}, (25.30) 


where C; includes all variables that are included in either x or z. The presence of z in 
the regression expunges the possible correlation between u and z. Note that if there is 
selection on unobservables, caused by common unobservable factors that affect both 
D and u, then we still have a potential identification problem. 

This estimator was used by Heckman and Hotz (1989), who also suggested a num- 
ber of variations on the basic control function estimators. 


25.3.4. Selection on Unobservables 


We now consider a special linear case in which the treatment participation decision 
is endogenous. This is an example of a well-known class of models with an “en- 
dogenous dummy variable.” The model is empirically very important when working 
with observational data because in such cases there are several reasons for aban- 
doning the restrictive assumption yo, yı L D| x or E[u|x,D] = 0. The breakdown 
of the conditional independence assumption implies that the simple least-squares re- 
gression cannot identify the ATE, and an alternative identification strategy should be 
pursued. 

The essential elements of the identification strategy we are about to discuss are 
common to other selection models. The approach involves fairly strong identifying as- 
sumptions and is fully parametric. In the special case considered, the specification is 
analogous to the Roy model. The conditional means in the outcome equations are taken 
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to be linear. The model is completed by adding a participation (binary) decision equa- 
tion for D;. Then 


yi = x; 3, + Uii, (25.31) 
Yoi = X; Bo + Uoi, 
Di = a7 + £i, 


where D* is a latent variable such that 


= | 1 iff D¥ > 0, (25.32) 


O iff D* < 0, 


and it is assumed that E[u; |x, z] =E[uo|x, z] = 0. 

The variables z may overlap with x, but it is assumed that at least one component of 
z, denoted zı, is unique and is a nontrivial determinant of D. That is, there is at least 
one independent source of variation in D. Hence we may refer to z; as an instrumental 
variable that is correlated with the endogenous variable D, but uncorrelated with the 
outcomes yı and yo, except through D. 

Next it is assumed that the triple (u1;, Uoi, €;) is jointly multivariate normal dis- 
tributed with zero mean and covariance matrix given by 


O11 910 Ole 
x= 010 O00 Oe . (25.33) 


Ole Qe 1 


The nonzero covariance parameters 0), and oo, reflect the endogeneity of the treat- 
ment variable. The covariance parameter ojo reflects the covariance between the out- 
comes. Because we never observe any individual in both states, this parameter cannot 
be identified and is usually set to zero. The variance of € is restricted to 1 for identifi- 
cation. 

Given such a fully parametric specification, the model can be estimated by maxi- 
mum likelihood or by a two-step semiparametric procedure. Most of these issues have 
been discussed in Chapter 16. Leaving aside the estimation issue, we consider mea- 
sures of treatment impact. 

The benefit of participation, or the ATET, is given by 


(zy) 


i —Elyo: | Di = 1] = yu — X; hae eee 25.34 
yı [yo; | ] = yu — X; Bo + 00 TeDe (25.34) 
which may also be written as 
= ie te) Q(z) 
Ely: | Di = 1] — Elyoi | Di = 1] = x; (61 — Bo) + oe — O16) (25.35) 


Dzy) 
where the term (ooe — o 1e) (z;7)/®(z:y) denotes the selection effect; see Section 
16.7.1. 

In the special case in which x; Gy = x; 64, and the treatment dummy enters the yı 
equation linearly with coefficient œ, the mean impact of the program is given by 


E[y; | Di = 1] — EL; | Di = 0] = æ + selection term. (25.36) 
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In some sample situations this identification strategy may be somewhat fragile. 
For example, the treated and untreated groups may be too different, the multivariate 
normality assumption may be inappropriate, or the identifying instrumental variable 
zı may be weak or possibly correlated with the error in the outcome equations. 

These considerations motivate the use of alternative estimation methods presented 
in this chapter. These estimators generally presume selection on observables only, 
though Section 25.7 presents IV methods applicable when selection is additionally on 
unobservables. 


25.4. Matching and Propensity Score Estimators 


In observational studies, by definition there are no experimental controls. Therefore, 
there is no direct counterpart of the ATE calculated as a mean difference between the 
outcomes of the treated and nontreated groups. In other words, the counterfactual is 
not identified. As a substitute we may obtain data from a set of potential comparison 
units that are not necessarily drawn from the same population as the treated units, but 
for whom the observable characteristics, x, match those of the treated units up to some 
selected degree of closeness. 

The average outcome for the untreated matched group identifies the mean counter- 
factual outcome for the treated group in the absence of the treatment. This approach 
solves the evaluation problem by assuming that selection is unrelated to the untreated 
outcome, conditional on x. To make the approach operational it is necessary to define 
the matching criteria. 


25.4.1. Treatment Effect Assumptions 


Matching estimators of treatment effects are useful when selection into treatment is on 
observables only. In addition it is assumed the overlap (or support) condition (25.6) 
applies, which means that for every x there is a positive probability of nonparticipation. 
This ensures that we have untreated matches for the treated observations for every 
x. Roughly speaking, the control and treated populations have comparable observed 
characteristics. Generating good matches means ensuring that the support condition 
does not fail. Further, the key assumption is that unobservable variables play no role 
in the treatment assignment and outcome determination. 

The regression estimator imputes the missing potential outcome using the estimated 
regression function. If D; = 1, yo,; is imputed using the estimated conditional regres- 
sion function ji9(x;). Matching estimators impute the missing value using the out- 
comes of the “nearest neighbors”; the latter are defined by a suitable metric based on 
some observable characteristics. This is the basis of the analogy between a matching 
estimator and nonparametric methods based on the number of nearest neighbors, typi- 
cally just one. The matching estimator typically approximates the difference between 
the means, and the variance of the estimator is estimated using many of the available 
results on variance of differences between the means. 
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Matching is a persuasive and attractive methodology if (1) we can control for a 
rich set of x variables, (2) there are many potential controls, and (3) ATET is the 
parameter of interest. It also requires the “no general equilibrium effects” assumption, 
or stable unit treatment value assumption (SUTVA), which implies that treatment 
does not indirectly affect untreated observations. The matching estimator avoids the 
assumption that the treatment effect enters the conditional mean function linearly. The 
initial step of establishing the nearest matches for each observation will also clarify 
whether comparable control observations are available. Unlike the regression approach 
there is less danger of extrapolation into regions outside the range of the data. 

Suppose the treated cases are matched in terms of all observable covariates. In a 
restricted sense all differences between the treated and untreated groups are controlled. 
Given the outcomes yj; and yo;, for the treatment and control, respectively, the average 
treatment effect is 


E [yulD; = 1] — E [yo |D; = 0] (25.37) 
= Ely — yo: | Di = 1] + {E [yo |D; = 1] — E [yo |D; = 0)}. 


The first term in the second line is the ATET, and the second term in braces is a “bias” 
term, which will be zero if the assignment to the treatment and control is random. In 
this case all that is necessary to estimate the ATET is a simple average of the differen- 
tial due to treatment. 

More realistically the data will involve some observed covariates x;. It is assumed 
that the covariates include variables that include the determinants of selection into the 
treatment group. If treated and nontreated groups are matched on each combination 
of covariates, then the treatment differential can be easily calculated for each treated 
case and each x;. The average of the differential over all treated individuals and all x; 
measures the average treatment effect. Formally, in this case (see Angrist and Krueger, 
2000, p. 1316) the effect of the treatment on the treated is given by 


Ely — yol D: = 1] = EHE [yu lx;, Di = 1] — E [yo |x;, Di = ODID; = 1] (25.38) 
= E[A,|D; = 1], 
where Ax = E[y|x;, Di = 1] — Elyoilxi, Di = 0]. 


If the x variables are discrete, then the matching estimator is defined as a weighted 
sum 


Ely; — yor/Di = 1] = $` Ay Prix; = x|D; = 1], (25.39) 


where Pr[x; = x|D; = 1] is the probability mass for x;, given D; = 1. Angrist and 
Krueger (2000) discuss several aspects of this estimator. 


25.4.2. Exact Matching 


The procedure is to match treated and untreated individuals on their observable char- 
acteristics x. 

Exact matching is practicable when the vector of covariates is discrete and the 
sample contains many observations at each distinct value of x;. 
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If the covariate vector has a high dimension, or if continuous variations among some 
covariates are present, then exact matching between treated and nontreated groups 
becomes impractical. This problem motivates inexact matching methods. Inexact 
matching works by mapping x into a lower dimensional measure, continuous or dis- 
crete, usually a scalar f(x) that forms the basis of matching. 


25.4.3. Propensity Scores 


The method of propensity scores (Rosenbaum and Rubin, 1983) is a popular inexact 
matching method. Rather than match on the regressors it matches on the propensity 
score. Even here an exact match is not possible, so the comparison units are those 
whose propensity scores are sufficiently close to the treated unit. 

The propensity score, the conditional probability of receiving treatment given x, 
denoted p(x), was suggested by Rosenbaum and Rubin (1983) as a matching measure. 
As noted in Section 25.2.5, if the data justify matching on x, then matching based on 
propensity score is also justified. 

The propensity score is usually estimated using a parametric model such as a logit 
or probit but can in principle be estimated using nonparametric methods. 


Matching Using Propensity Scores 


In the method of propensity scores one controls for the covariates by controlling for 
a particular function of the covariates, specifically the conditional probability of treat- 
ment, Pr[D; = 1|x;]. That is, matching is on the propensity score. This can be eas- 
ily calculated by (for example) a logit regression. Moreover, one can also control for 
lagged variables by including them in the vector of covariates. If selection bias is elimi- 
nated by controlling for x;, itis also eliminated by controlling for the propensity score. 
Conditioning on the propensity score is often simpler than conditioning on a large di- 
mensional vector x. Dehejia and Wahba (1998) provide an empirical illustration based 
on the data previously used by Lalonde (1986). 


Implementation Issues 


Propensity score methods call for a good model to generate the scores. Our interest 
is in estimating consistently the participation probability rather than the estimates of 
parameters in the propensity score function. A better statistical fit for the propensity 
score is more likely to result from a flexible parametric or nonparametric model. 

In implementing matching based on p(x;) three issues are relevant: (1) whether to 
match with or without replacement, (2) the number of units to use in the comparison 
set, and (3) the choice of the matching method. 

Matching without replacement means that any observation in the comparison group 
is matched to no more than one treated observation, that which is the closest match, 
whereas with replacement means that there can be multiple matches. If matching with- 
out replacement, the smallness of the comparison set would mean that the matches may 
not be very close in terms of p(x), which will increase the bias of the estimator. 
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The issue of choosing the number of cases in the comparison set involves trade- 
off between bias and variance. By using a single closest match to a treated case, 
one reduces the bias, but by including more matched controls, the variance is re- 
duced whereas bias increases if the additional observations are inferior matches for the 
treated observations. A partial solution is to use a predefined neighborhood in terms 
of a radius around the p(x) of the treated observation and to exclude matches that lie 
outside this neighborhood. In other words, one only uses the better matches. This is 
called “caliper matching.” 

Heckman et al. (1997, 1998) study the performance of matching estimators using 
experimental data from the Job Training Partnership Act (JTPA) combined with sam- 
ples of comparison groups from three sources. Data quality plays a key role in robust 
estimation of treatment effects by matching methods. The results are best when the 
data sources and definitions are comparable for treated and nontreated groups, when 
the treated and nontreated come from the same labor market, and when the propensity 
score can be modeled using a rich set of regressors. 

The issue of the sensitivity of the results to the chosen method is not amenable to 
a simple direct answer. The outcome may vary across different samples, depending 
on the extent of overlap between the treated and untreated observations. If the two 
groups are similar in the sense that there is a substantial overlap in their propensity 
scores, and if the comparison group is large, then the matches will be easier to find 
and matching with replacement will be feasible. If the comparison group is small and 
disparate from the treated group, then one may run out of satisfactory matches and be 
unable to use the full treated sample, this being especially likely if matching is without 
replacement. 

The application of Dehejia and Wahba (2002) to the National Supported 
Work Program data provides an instructive illustration. We examine and illus- 
trate the issues of implementation in Section 25.8 using the Dehejia and Wahba 
data set. 


25.4.4. Measuring Treatment Effects 


Denote the comparison group for the treated case i with characteristics x; as the set 
Ajax) = { jlx; € c(xi)}, where c (x;) is the characteristics neighborhood of x;. Let Nc 
denote the number of cases in the comparison group and let w(i, j) denote the weight 
given to the jth case in making a comparison with the ith treated case, $- jw, j=l. 
Then a general formula for the matching ATET estimator is 


— > Du- 2 w(i, j)yo,j]; (25.40) 


T ie{D= =} 


where 0 < w(i, j) < 1, and {D = 1} is the set of treated individuals, and j is an ele- 
ment of the set of matched comparison units. Different matching estimators are gener- 
ated by varying the choice of w(i, j). 
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Matching Methods 


Simple matching compares cells with exactly the same discrete x, 


AM = ye wli — Youl> (25.41) 
k 


where y, is the mean outcome of the treated and Yọ is the mean outcome of the un- 
treated and w+ is the weight of the kth cell (i.e., the fraction of observations in cell k). 
A specific example (Dehejia and Wahba, 2002) is 


1 1 
Yi yj], (25.42) 
Nr =| Noi ppa 0} ' 


where Ny is the number in the treated group (D = 1) and Nc, is the number in the 
comparison group corresponding to the ith observation. 

The nearest-neighbor matching method chooses, for every treated individual i, the 
set Aj(x) = { j| min; |x —x jl }, where |||| denotes the Euclidean distance between 
vectors. If w(i, j) = 1 in (25.40) when j € A;(x), and zero otherwise, then this speci- 
fication uses only one case to construct the comparison group for the treated cases. 

Another estimator is generated by kernel matching in which 

K (x joa x;) 


wips e a 
Ji K(x; — xi) 


where K is a kernel discussed in Section 9.3. 

These methods share the advantage that they avoid functional form assumptions for 
the outcome equations in estimating ATET and can estimate it at specific values of x. 
They have the disadvantage that if x is high dimensional then the number of matches 
can become very small. In such cases matching based on a scalar-valued metric has 
attractions. Propensity score matching, discussed previously, is such a method. 

Nearest-neighbor and kernel matching can be defined in terms of propensity scores 
also. For example, for nearest-neighbor matching we can define the matching set as 
Ai(p(x)) = {p;| min; || pi — p; ll}- 

Stratification or interval matching is based on the idea of dividing the range of 
variation of the propensity score in intervals such that within each interval the treated 
and control units have, on the average, the same propensity score. One can use the 
same blocks identified by the algorithm used for computing the propensity scores. 
Then we compute the difference between the average outcomes of the treated and 
the control groups. ATET is the weighted average of these differences, with weights 
being determined by the distribution of the treated units across the blocks. One of the 
disadvantages of this method is that it discards observations in blocks in which either 
treated or control units are absent. 

Denote by b the blocks defined over intervals of propensity score. Then the treat- 
ment effect within the bth block is defined as 

ATETS = (NA) Y Yu —(NE) Y Yoj, 
iel(b) jEr) 


where I(b) is the set of units in block b, Nj is the number of treated units in the bth 


875 


TREATMENT EVALUATION 


block, and N£ is the number of control units in the bth block. Then the treatment effect 
based on stratification is defined as 


B 
ATETS = ` ATET; x X D / > p | , (25.43) 
b=1 iel(b) 
where the term in brackets is the weight for each block given by the corresponding 
fraction of treated units and where B is the total number of blocks. 

In radius matching the set A;(p(x)) = {p;| |p: — p;|| < r} is based on propen- 
sity scores. This means that all control cases with estimated propensity scores falling 
within radius r are matched to the ith treated case. 

We can express ATE and ATET in terms of p(x), assuming the overlap condition 
0 < p(x) < 1. The two key results are 


ATE=E ae (25.44) 
p(x) (1 — p) 
(D — p(x))y 
ATET = E ; 25.45 
Ee =Hd- | i ) 


the last result is due to Dehejia (1997). 
The derivations of these results are as follows: 


y = (1 — D)yo + Dy 

= yo + D1 — yo), 
(D — p(x))y = (D — p(x))(yo + DOI — yo)) 
= Dy; — p(&)yo — Dp(x)y1 + Dp yo 
= Dy, — p(x) — D)yo — Dp(x)y1. (25.46) 
Next, taking expectations and noting that E[D|x] = p(x) we get 
ELD — p0x))y|x] = pE] — P — pOO)ELyo] — PPE] (25.47) 
= PEL — py] — pC — pR)ELyo] 
= p(x)(1 — p(x))ELy — yo], 


whence it follows that 
(D — p(x)) 
ATE = Ely; — yo] = E |e | 
p(x) (1 — p(x)) 
To derive the Dehejia result, we have 
(D — p(x)) 
E [aee] = ELp@ELu %) — Ho(%)] (25.48) 
1- pk) 
= E[D(y1 — yo)] 
= E[D(y1 — yo)|D = 1]Pr[D = 1], 
where the first line follows from (25.47), the second line is implied by the conditional 
independence assumption, and the last line expresses joint expectation as a product of 
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marginal and conditional expectations, which implies 
ELD: — yo)] 


ATET = 
Pr[D = 1] 

Using (25.44) and (25.45), consistent estimators, based on a sample of size N, are 

= 1 (Di — PE) yi | 
ATE = 2 2 (25.49) 

N D P) — P:)) 

=i 
jee gees N L ani] 
ATET = | — Di — =], (25.50) 
aaa 


where (N7! YG D;) is a consistent estimator of Pr[D = 1]. 


25.4.5. Variance of ATET Based on x and p(x) 


Under identifiability assumptions given in Section 25.2, Ax and A p(x) are defined as 


Ax = — > [yu — ElyolD = 0, x = xi] 


-5 5 Lyi — D wij yo,j] 


ie{D=1} jedi) 


and 
Ana) = = Di — BlyolD = 0, px) = pI, 
1 


= Nr l D [yi — l > Wij Yo, j], 
ie{D=1} JEAj(p(x)) 
where i is the subscript for the treated group, w;; = 1/Nc,;, and Nc,; is the number of 
cases in the comparison group for the ith treated case. Both are consistent estimators of 
ATET, E[y; — yo|D = 1, x], the first based on x, and the second on p(x). A practical 
issue is whether adjusting for differences by propensity score is better in terms of 
efficiency than adjusting for differences using x. Hahn (1998), Heckman et al. (1998), 
and others have shown that there is no unambiguous ranking of the two estimators in 
terms of their asymptotic variance, even if we assume that p(x;) is known, which in 
practice will not be the case in observational studies. 
Write the asymptotic variances for the two cases as follows: 
V[Ax] = E[V[yi|D = 1, x]|D = 1]+ VIED: — yolD = 1, x]|Ð = 1], 
VIA pw] = ElV |D = 1, p@)|D = 1] + VIEL — yolD = 1, p(®)]D = 1], 

where we use the variance decomposition result given in Section A.8. In general x is a 
better predictor than p(x), which implies that 


E[V[yi|D = 1, x]|D = 1] < E[V[y)|D = 1, p@]|D = 1], 
VIELyi1 — yolD = 1, x]|D = 1] > VIEL — yo|D = 1, p@)]|D = 1], 
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because conditioning on x loses less information than conditioning on p(x), which is 
a particular function of x. Thus the second comparison favors the propensity score 
method whereas the first term comparison favors the use of x over p(x). 

A helpful practical guide and computer programs for implementing the calculations 
of ATET are provided by Becker and Ichino (2002). 


25.5. Differences-in-Differences Estimators 


Chapters 2 and 3 discussed the setting of a natural experiment or a quasi-experiment 
in which a treatment variable undergoes a change that can be viewed as an exogenous 
variation in a treatment variable. The treated group can be compared to an untreated 
comparison group. 

In some cases one has data on the treated and the comparison (control) groups 
both before and after the experiment. Then for the ith treated case the change in 
the outcome is measured by [Y;a — Yib| Dia = 1] and that for the untreated group 
is measured by [y;a — yip|Dig = 0]. Then the differences-in-differences measure 
[Yia — Yibl Dia = 1] — Dia — Yiv|Dia = 0], where subscripts a and b denote “after” 
and “before” the experiment occurs, forms the basis of an estimate of the treatment 
effect. This method has been introduced in Sections 3.4.2 and 22.6. 

Consider a model with a fixed effect ġ; and a drift term 5;, where the pre-treatment 
and post-treatment outcomes are given by, respectively, 


Vito = Qi + ôr + Eir, (25.51) 
Yit 1 = Vito + Q, (25.52) 

so that 
Yit = (1 — Dit) Vito + Dit Vita, (25.53) 


= Qi +6; +4Dir + €ir. 


The preceding equations are for t = a, b; (25.51) is for the group that did not get 
treated and (25.52) is for the group that did get treated. Using the “before” and “after” 
formulation, we obtain the treatment effect 


a = ElYia — Yin|Dia = 1] — ElYia — Yib| Dia = 0] (25.54) 
= {EL vial Dia = 1] aa Eal Dia = O}} 
— {ELyio| Dia = 1] — Elyio| Dia = 0)} , 


where the differencing step eliminates the fixed effect œ and the drift 4,. 
There are alternatives to taking differences. One alternative is to control directly for 
pretreatment outcome difference between treatment and control groups by regression. 
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For example, replace ¢; in (25.51) by x; 6 + yyip to obtain 


Yia,0 = XB + YYib + ba + Eia, (25.55) 
Vial = x, 3 + YYib + Sa + aDia + Eia- 


Estimates of a are constructed by regressing posttreatment outcomes on a constant, 
pretreatment outcomes, x;, and D;,. The interpretation of «œ as a causal parameter relies 
on the assumption that after controlling for x, and yp, the treatment effect completely 
accounts for the posttreatment difference between the treated and control groups. The 
fixed effect is given a linear functional form, whereas a matching strategy can be based 
on weaker assumptions. 

Our previous results could actually be based on quasi-experimental data. For ex- 
ample, compare people in one state with one law with those in a different state with 
a different law, and use control functions for the state effects. The new element is the 
addition of data before the experiment. By the assumption that the two states have the 
same drift term, we can use the differences-in-differences method to eliminate the state 
effects for which otherwise we would need control functions. 


25.6. Regression Discontinuity Design 


Identification of the treatment effect can sometimes be facilitated by either a natu- 
ral experiment or using data generated in a quasi-experimental setting. Regression- 
discontinuity (RD) design is an example of a quasi-experimental design in which the 
probability of receiving a treatment is a discontinuous function of one or more underly- 
ing variables. Such a design can arise in circumstances where a treatment is triggered 
by an administrative or organizational rule. For example, Angrist and Lavy (1999) 
study the effect of class size on student test scores, taking advantage of the data gener- 
ated under the operation of “Maimonides Rule,” which stipulates that the class be split 
when it reaches a specific threshold size. Van der Klaauw (2003) estimates the effect of 
financial aid offers on the student’s decision to attend a college, exploiting the identi- 
fying information provided by a discontinuity in the administrative rule that relates the 
aid to the student’s SAT score and the grade point average. These econometric appli- 
cations are predated by Thistlethwaite and Campbell (1960), who analyzed the impact 
of student scholarships on career aspirations, exploiting the fact that the awards are 
made only when the student’s test score exceeds a threshold; see also Trochim (1984). 
The treatment here follows Van der Klaauw (2003). 


25.6.1. Discontinuous Treatment Assignment Mechanism 


In the case of an RD design, there is additional information about the selection rule: 
It is known that the treatment assignment mechanism depends (at least in part) on the 
value of an observed continuous variable relative to a given threshold, or cutoff score, 
in such a way that the corresponding probability of getting treated (propensity score) 
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Regression Discontinuity Example 
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Figure 25.1: Regression-discontinuity design: example. 


is a discontinuous function of this variable at the cutoff score. Figure 25.1 illustrates a 
sample generated by the RD design. 

In the simplest RD design, called the sharp RD design, individuals are assigned to 
treatment and control groups solely on the basis of an observed continuous measure S, 
called the selection or assignment variable. Those falling below the distinct cutoff S$ 
do not receive treatment and constitute the control group whereas those that are above 
the cutoff receive treatment (D = 1). That is, the treatment assignment occurs through 
a known and measured deterministic decision rule: D; = 1[S; > S]. In Figure 25.2 the 
sharp RD design is shown as a solid line (see Van der Klaauw, 2003). 

In the sharp RD design 


E[u|D, S] = EfulS], (25.56) 


where u denotes the error in the outcome equation. Because S is the only systematic 
determinant of D, S will capture any correlation between D and u. 

With D; = D(S;) =1 [ S; > S], a dependence between D; and u; would make OLS 
an inconsistent estimator of œ. As previously mentioned, one approach to estimating 
the treatment effect in such a case is to specify and to include the conditional mean 
function E[u|D, S] as a “control function” in the outcome equation. Thus 


yi = B+aD; +k (Si) + £i, (25.57) 


where €; = y; — E[y;|D;, Si]. If k(S) is correctly specified, the regression will consis- 
tently estimate a. 

If k(S) is linear then œ will be estimated by the distance between the two linear 
parallel regression lines at the cutoff point, which in this case equals the difference 
between the two intercepts. It is an unbiased estimate of the common treatment effect 
if the control function is linear. 

In the more general case of varying treatment effects in which the coefficient 
of D represents E[a@;|S], or local LATE discussed in Section 25.7.1, where k (S) 
is a specification of E[u|S] + (E[a;|S] — Efa;|S])1[S > S], where 1[S > S] = 1 if 
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the condition in parenthesis is satisfied. Incorrect specification of k (S) leads to in- 
consistency, and hence a semiparametric specification may be tried, for example, 
k(S)= Di 1j SJ, where J may be determined by a suitable method. 

The variable S may be related to the outcome y, which would automatically cause 
(y, D) to be related even when there is no causal link between the two variables. This 
contrasts with random assignment that avoids such dependence. 

Whereas random assignment makes treatment and control groups similar in re- 
spects other than the receipt of treatment, the sharp RD design makes them differ- 
ent, at least in terms of their S value. This violates the “strong ignorability” as- 
sumption of Rosenbaum and Rubin (1983), which also requires the overlap condition, 
0 < Pr[D = 1|S] < 1, whereas in the sharp RD design model Pr[D = 1|S] € [0, 1]. 


25.6.2. Identification and Estimation under RD Design 


The main intuition is that the sample of individuals in the small neighborhood of the 
cutoff will be similar to a randomized experiment at the cutoff point because they 
have essentially the same S value. Those just below the cutoff are expected to be very 
similar to those just above it. A comparison of the average y value of those just above 
and those just below the cutoff will produce an estimate of the average treatment effect. 

Increasing the interval around the cutoff will bias the estimate of the treatment ef- 
fect, especially if the assignment variable was itself related to the outcome variable 
conditional on treatment status. If an assumption about the functional form of this 
relationship can be made then one can “use more observations and extrapolate from 
above and below the cutoff point to what a tie-breaking randomized experiment would 
have shown. This double extrapolation, combined with exploitation of the ‘random- 
ized experiment’ around the cutoff point, has been the main idea behind regression- 
discontinuity analysis” (Van der Klaauw, 2003, p. 1258). 

Observe that in this RD design, 


lim E[y|S] — lim E[y|S] = æ + lim E[u|S] — lim E[u|S]. (25.58) 
SYS StS SYS StS 


A more formal way of assuming that, in the absence of treatment, individuals 
in a small interval around § would have similar average outcomes is to specify the 
following: 


Assumption A1. The conditional mean function E[u|S] is continuous at S. 


Assumption A2. The mean treatment effect function E[a;|S] is right continuous at S: 
yi = B+ aD; +k (Si) + £i, (25.59) 


where s; = y; — E[y;|Dj, Si]. 
Then the result in (25.58) follows. 
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25.6.3. Fuzzy RD Design 


Here the treatment assignment depends on the selection variable in a stochastic man- 
ner. The relation between the propensity score Pr[ D = 1|S] is known to have a discon- 
tinuity at S. A possible consequence of misassignment relative to the cutoff value is 
a fuzzy design, with values of S near the cutoff point appearing both in the treatment 
and control groups. Alternatively, the assignment may be based on additional variables 
observed by the treatment administrator but unobserved by the program evaluator. So 
relative to the sharp RD design, the fuzzy RD design selection depends on both ob- 
servables and nonobservables. In Figure 25.2 the fuzzy RD design is shown as a dashed 
line. 

We can still exploit the discontinuity in the selection rule to identify the treat- 
ment effect under assumption A1. If E[u|S] is continuous at S, then lims ys ELy|S] — 
lim, E[y|S] = a[lims \5 E[D|$] — lims,5 E[ D|S]]. Therefore, the treatment effect 
a is identified by 


lims; Ely|S] —limg,3 ELy|S] 
lims,5 E[D|S] —lims,5 E[D|S]’ 


(25.60) 


where the denominator lim,)5; E[D|S] — lim,,5; E[D|S] 4 0 because of the known 
discontinuity of E[D|S] at S. 
In the case of heterogeneous treatment responses we need additional assumptions. 


Assumption A2*. The average treatment effect function E[æ;|S] is continuous at S. 


Assumption A3. D; is independent of œ; conditional on S near S: 
yi = B+ QE[D;:| Si] + k (Si) + £;i, (25.61) 
where c; = y; — E[y;|D;, Si] and k(S;) is a specification of E[u; |S;]. 


25.6.4. A Two-Stage Estimator 


If Cov[D, u] 4 0, OLS regression will produce a biased estimate of a. However, the 
following can lead to a consistent estimator. Consider 


yi = B+ &E[D;| Si] + k (Si) + £i, (25.62) 


where s; = y; — E[y;|S;] and k(S;) is a specification of E[u;|S;]. 


Stage 1: Specify propensity score function for a fuzzy RD design as 
E[D;|Si] = f(S:) + y1[6; = S], (25.63) 


where f(S;) is some continuous function of S that is continuous at S. By specifying 
the functional form of f (or by estimating f semi- or nonparametrically) we can 
estimate y, the discontinuity in the propensity score function at S. 


Stage 2: The control function-augmented outcome equation is then estimated with D; 
replaced by the first-stage estimate of E[D;|S;] = Pr[ D; = 1|5;]; this estimate will 
be discontinuous in S whereas the included control function for k (S) would be 
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Figure 25.2: Regression Discontinuity Design; treatment assignment in sharp (solid) and 
fuzzy (dashed) designs. 


continuous in S at S. Under correct specification of f(S;) and k(S;) the two-stage 
procedure is consistent. 


25.7. Instrumental Variable Methods 


In recent years instrumental variable methods have been strongly advocated as an al- 
ternative to MLE and other strongly parametric methods (Angrist, Imbens, and Rubin, 
1996). Such an identification strategy is attractive in models with selection on un- 
observables (see Section 25.3.4). In many applications such a model consists of a 
linear equation for a continuous outcome variable whose conditional mean and vari- 
ance structure is specified, without any additional distributional assumptions. A lead- 
ing case has a continuous outcome dependent upon a vector of regressors x and a single 
endogenous treatment (dummy) variable (D) that represents the decision to participate 
in the treatment. This equation is called the participation or selection equation. In a 
more general setting, one may have a limited dependent or discrete outcome and there 
may be multiple treatment variables. 

The discussion that follows overlaps with the coverage of IV estimation in several 
places in this book and with that of selection models. The IV approach allows us to 
develop another “local” variant of the ATE parameter. 


25.7.1. Local ATE (LATE) 


We reconsider the simple linear formulation. The outcome equation is a linear function 
of observable variables x and a participation indicator D: 


Ji =x, 8+ aD; + Uj, (25.64) 
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and the participation decision depends on a single variable z, referred to as an instru- 
ment, 
DF =yotyizi tu, (25.65) 


where D* is a latent variable with its observable counterpart D; generated by 
(25.66) 


There are two assumptions: 


1. There is a variable z that appears in the equation for D that does not appear in the 
equation for y. It may be continuous or discrete, and in a special case it is binary. The 
exclusion of regressors x from the participation equation is a simplification. The simul- 
taneous presence of z in the participation equation and its exclusion from the outcome 
equation is referred to as the exclusion restriction. This model structure is familiar from 
Chapter 16 on selection models. 


2. Cov[z, v] =Cov[u, z] =Cov[x,u] = 0, and 
Cov[D, z] 4 0. 


Together with the first assumption, this assumption implies, as previously emphasized, 
that y depends on z only through D, and D depends on z in a nontrivial fashion. Hence 
we use the notation D (z) to emphasize the dependence of D on z. 


Under these assumptions IV estimation of (25.64) yields consistent estimates of 
(6, a). Let z' =z + ô, 6 £0. Then noting that E[D|x, D(z)] = Pr[ D(z) = 1] and 
taking expectations we obtain 

E[y|x, D (z)] = xB + aPr[D (z) = 1], 
Ely|x, D (z’)] = xB + aPr[D (z’) = 1], 
where, after subtraction, we have 
Ely|x, z'] — Elylx, z] =o [PrD (z’) = 11 — PrlD@ = 11]. 


Solving the equation for œ yields the expression for the local average treatment 

effect (LATE), analyzed by Imbens and Angrist (1994): 

E[y|x, z'] — ELy|x, z] 
Pr[D (z’) = 1] — Pr[D (z) = 1] 

Lees [ELy |x, z] E E[y]x, zl] dF (x|x eR (x)) 
Fra (PtLD e) = 1] — Pri (@) = IdF xix € ROW)’ 

ELylz’] — Efylz] 

Pr[D Kz’) = 1] — Pr[D (2) = 1] 
where the second line involves averaging over x, whose support is denoted by R (x). 
This expression is well defined if Pr[D (z') = 1] — Pr[D(z) = 1] # 0. The sample 
analogue of this expression is the ratio of the mean difference between the treated and 
the nontreated divided by the change in the proportion treated owing to the change in z. 
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(25.67) 


QLATE = 
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This estimator is an IV estimator. Using the results on the asymptotic normality of the 
IV estimator, we can obtain confidence intervals for the LATE parameter. 

The qualifier “local” in LATE is justified because it measures the treatment effect on 
the “compliers” that are induced to participate in the treatment as a result of the change 
in z. LATE depends on the particular values of z used to evaluate the treatment and on 
the particular instrument chosen. The group of “movers” may not be representative of 
the whole treated population, let alone the whole population. Consequently, the LATE 
parameter may not be informative about the consequences of large policy changes 
brought about by changes in instruments different from those historically observed. 

For binary instrument the LATE and the IV estimates are equivalent, as shown in 
Angrist et al. (1996, p. 447). If more than one instrument appears in the participation 
equation, as when there exist overidentifying restrictions, the LATE parameter esti- 
mated for each instrument will in general differ. However, a weighted average may be 
constructed. 

The foregoing analysis applies when the treatment effect does not vary with indi- 
viduals. If, however, the treatment effect is heterogeneous, then there is a potential for 
confounding the variation induced by z: Is the observed variation due to z-differences 
or a-differences? Under heterogeneity the idiosyncratic component of the treatment 
effect, 


uia = tio + Dj (a; Xi) — a (x;)), 


is a function of a; (x;) — a(x;), see (25.27). Then the previous assumptions are not 
enough to determine ATE or ATET. A solution to this difficulty is the addition of the 
monotonicity assumption as an additional identifying condition. Essentially this says 
that the instrument affects participation in a monotonic fashion, so that if on average 
participation is more likely given Z = w than given Z = z, then anyone who would 
participate given Z = z must also participate given Z = w. 


25.7.2. Relation to Other Measures 


The IV estimator of «œ is the same as what we would estimate by using a two-stage 
least-squares procedure in which we first estimate the probability of receiving treat- 
ment, E[D = 1|x, z], and then run a regression of the outcome y on x and the fitted 
probability, assuming of course that the treatment effect is additive. Consider a special 
case of the IV estimator in which x is a scalar and equals one, and z is a scalar dummy 
variable that denotes eligibility to participate in the treatment; z = 1 implies eligibility 
and z = 0 implies noneligibility. 

We can partition the population into four categories: compliers (C), always-takers 
(A), never-takers (N), and defiers (D). Compliers are induced to receive treatment by 
being eligible, always-takers will receive treatment whether or not they are eligible, 
never-takers refuse treatment regardless of eligibility, and defiers are contrarians who 
refuse treatment if eligible and take treatment if not. Assume that there are no defiers, 
so there are just three categories. 
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The Wald estimator of the treatment effect is defined by 


Ely|zi = 1] — Ely k= 0] 
E[D;|z; = 1] — E[D;|z; = 01’ 


TEwatp = (25.68) 
whose numerator, expressed as a weighted average of treatment effects on the three 
categories, with weights equal to the probability of being in each category, is 


Pr[C]{ELyilzi = 1, C] — Elyilz; = 0, C]} 
+ Pr[A]{E[yilzi = 1, A] — Elyilzi = 0, AJ} 
+ PrN ]{ELyilzi = 1, N] — Elyilz; = 0, NJ} 
= Pr[C]{E[yilzi = 1, C] — ELyilzi = 0, C]}. 


The result in the final line follows because the terms corresponding to always-takers 
and never-takers are identically zero. The denominator in (25.68) is the probability of 
compliance, Pr[C]. Therefore, 


TEwatp = Elyza = 1, C] — Elyojlz; = 0, Cl. (25.69) 


If we compare TEwarp with the LATE measure, we find that LATE is a measure of 
the effect of treatment on the subgroup of those at the margin of participating, denoted 
as compliers. 

In empirical economic applications the concept of a marginal impact caused by 
variation in a continuous variable, measured by a partial derivative, is well entrenched 
and is replaced by a discrete analogue when the variation in the causal variables is dis- 
crete. Thus a marginal treatment effect (MTE) measure conditional on x is defined 
as 

MTE = uel ; (25.70) 
dPr[D = 1|x, Z] |2- 

Heckman and Vytlacil (2002) show that ATE, ATET, and LATE are all averages 
of MTE taken over different subsets of the Z support, or subpopulations. ATE is the 
expected value of MTE over the full support of z, including where participation rate is 
zero or one. ATET excludes the support of z where participation does not occur. LATE 
is the average of MTE over an interval of z where participation rates differ. 


25.7.3. IV Estimation in a Model with Heterogeneous Treatment Effect 


We now consider a model that allows for selection on ubobservables and heteroge- 
neous treatment effect. The context is of a linear model with an endogenous treat- 
ment variable whose coefficient is random, see Bjorklund and Moffitt (1987). Such a 
model, which can be motivated by the consideration that the treatment effect is not con- 
stant across the treated, has been considered by Wooldridge (1997) and Heckman and 
Vytlacil (1998). 

We write the model as a simultaneous equations model with the outcome variable 
yı that depends upon treatment variable y2. For simplicity the treatment variable y2 is 
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taken to be continuous. Given instrument z and exogenous variable x;, the model is as 
follows: 


Yui = (Œœ + vi)yzi + X; bı + £i (25.71) 
= ayo; + X; B1 + £i + vi Yz 
= viy + ay + X; bı + wi, 

Yi = VZzi + X; Ba + N;, (25.72) 


where w; = £; + v;(y2; — Y2). The marginal response of y; with respect to a change 
in y2 is (a + v;i), which varies across individuals, thus permitting a heterogeneous 
treatment effect. 

Suppose E[e;|x;, yu] = Elu;|x;, yz] =0. Then E[e; + v;y2;|X;, yz] = 0, and 
V[e; + vi ¥2;|X;, y2] depends on x; and hence is heteroskedastic. Then the least- 
squares estimator of (œ, B;) is consistent but not efficient. This follows from the as- 
sumed exogeneity of y2. 

We next consider the case where the treatment variable is endogenous. The follow- 
ing assumptions are made: 


Ele;|x;, zi] = Eln; |x; zi] = Elv: |x; zi] = 0, (25.73) 

Elex; z] = 0%; Elv? lx, zi] = 0; Ely? |x, zi] = 0o}. (25.74) 
Endogeneity is introduced by permitting correlation between v and ņ. Specifically, 
assume that E[v;|n;] = on;, which would hold if (v, n) were bivariate normal dis- 
tributed. Under these assumptions, z is a valid instrument, and x is exogenous. The 
exclusion of z from the yı equation is an identifying restriction. Therefore instrumen- 
tal variable estimation of (25.71) with instruments (z, x) is a natural estimator. Note, 
however, that the condition for consistent estimation is E[w;|x;, zi] = 0. The first com- 
ponent of w;, €;, is uncorrelated with z; by assumption; the second component of w; is 
v;(y2; — Y2), which may at first sight seem to to be correlated with z; on which yz; de- 
pends. If so, the IV estimator would be inconsistent. However, it can be shown that 
the IV estimator is consistent under the preceding assumptions. The key step in 
the argument involves showing that E[v; y2;|z;] = E[v;y2;], a result established in 
Wooldridge (1997) by applying the law of iterated expectations; thus, 


Elvy2|z] = E[E[vye|z, n]|z] (25.75) 
= E[y2E[vlz, n] |z] = E [eny2|z] 
= pE[n’|z] = po? = Elvyn]. 


Although the IV estimator is consistent under the assumptions given here, it is not 
efficient because of the heteroskedastic error. Hence heteroskedastic-consistent stan- 
dard errors should be used. Finally, we have not tackled the issue of sensitivity of esti- 
mated treatment effects to the choice of instruments when the response to treatment is 
heterogeneous. 
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25.7.4. Endogenous Treatment in Nonlinear Models 


Consider how the analyses of Sections 25.3 and 25.7 change if the outcome of a job 
training program were employment rather than earnings, or was duration to job place- 
ment. Alternatively, suppose that posttraining a significant proportion remains unem- 
ployed and has zero earnings, so that the sample is a mixture of those with zero and 
positive earnings and hence will be nonnormal. How should one extend the previous 
methods to handle the complications of nonlinearity and nonnormality? 

The specification and estimation of nonlinear, nonnormal models of treatment and 
outcome with selection is an issue that occurs frequently in microeconometrics. As in 
linear models, a major focus in such models is on the effect of an endogenous treat- 
ment variable on an economic outcome. The model specification comprises an out- 
come equation with a structural—causal interpretation and other equations that model 
the generating process of treatment variables. There are two broad approaches to this 
problem, a parametric one that relies on likelihood-based (including Bayesian) meth- 
ods and a semiparametric one that relies on GMM or linearized IV methods. 

The typical setup is illustrated by the following selected examples. In labor eco- 
nomics, Bingley and Walker (2001) examine the effect of duration of husbands’ un- 
employment on wives’ discrete labor supply choices. Here the treatment variable is 
nonnegative and possibly censored or truncated. Pitt and Rosenzweig (1990) study the 
effect of endogenous health status of infant children on their mothers’ main daily ac- 
tivity; here the treatment variable is discrete and the outcome is continuous. Carrasco 
(2001) examines the effect of childbirth on labor force participation of women. In 
treatment—outcome models related to fertility, Jensen (1999) examines the effect of 
contraceptive use, a discrete variable, on duration between births, a limited dependent 
variable. Olsen and Farkas (1989) examine the effect of childbirth on the hazard of 
dropping out of school. In health economics, Kenkel and Terza (2001) examine the 
effect of physician advice (discrete) on the consumption of alcohol (continuous and 
nonnegative). Gowrisankaran and Town (1999) study the effect of hospital choice on 
the hazard of death in a hospital. In health economics the impact of health insurance 
choice on health care utilization, sometimes measured as an expenditure variable and 
sometimes as a count of number of units of some specific type of service such as doctor 
visits or hospital admissions, is frequently studied using the framework of a two-part 
model (Deb and Trivedi 1997). Terza (1998) and van Ophem (2000) model the effect 
of household vehicle ownership on counts of trips. Many other examples can be cited. 

These models share many statistical features. First, both treatment and outcome pro- 
cesses are nonnormal and nonlinear: multinomial, count, discrete, or censored. Second, 
in each model the treatment is endogenous. Finally, investigators often have good a pri- 
ori reasons for choosing particular parametric marginal models for both treatments and 
outcomes. However, the transition from given marginal distributions to a joint model 
for treatment and outcome is an essential step that is potentially problematic when 
nonnormal multivariate distributions are involved. Often the marginal models have no 
(or very restrictive) tractable multivariate counterparts (e.g., in models of counts and 
durations). In others, treatment and outcome are from different statistical families (e.g., 
treatment being a multinomial and the outcome being a hazard rate) and so analytically 


888 


25.8. EXAMPLE: THE EFFECT OF TRAINING ON EARNINGS 


tractable multivariate distributions often do not exist. Because of the specialized nature 
of applications in this area, this topic is not pursued any further here. 


25.8. Example: The Effect of Training on Earnings 


The National Supported Work (NSW) demonstration project, conducted in the 1970s, 
measured the impact of training on earnings by a randomized experiment that assigned 
some individuals to receive training (a treatment group) and others to receive no train- 
ing (a control group). The effect of training could then be measured by direct compar- 
ison of sample means of posttreatment earnings for the treatment and control groups. 

As was discussed in Chapter 3, randomized experiments are relatively rare in the 
social sciences. More often an observational sample is used with some individuals 
observed to receive a treatment while others do not. Comparison of the treated with the 
nontreated must then control for differences in observed characteristics, and possibly 
in unobserved characteristics. 

To determine the adequacy of standard microeconometric methods for observational 
data, Lalonde (1986) contrasted outcomes for the NSW treated group with those for 
control groups drawn from two national surveys. He obtained results that differed sub- 
stantially from the experimental results that contrasted the NSW treated and control 
groups, and he concluded that the observational methods were unreliable. 

Dehejia and Wahba (1999, 2002) reanalyzed a subset of the Lalonde data using al- 
ternative matching methods, which they argued led to conclusions from observational 
data that were considerably closer to those from experimental data. In this section we 
use their data from Dehejia and Wahba (1999) to illustrate the application of methods 
introduced in Sections 25.2 to 25.5 that control only for selection on observables. 


25.8.1. Dehejia and Wahba Data 


The treated sample is one of 185 males who received training during 1976-1977. The 
control group consists of 2,490 male household heads under the age of 55 who are 
not retired, drawn from the PSID. Dehejia and Wahba (1999) call these two samples 
the RE74 subsample (of the NSW treated) and the PSID-1 sample (of nontreated). 
The treatment indicator variable D is defined as D = 1 if training is received (so the 
observation is in the treated sample) and D = 0 if no training was received (and the 
observation is in the control sample). 

Summary statistics for key variables are given in Table 25.3. The treated group 
differs considerably from the control group, being disproportionately black (84%) with 
less than a high school degree (71%) and unemployed in the pre-treatment year 1975 
(71%). Estimates of the effect of training should control for these differences. 


25.8.2. Control Function Approach 


Various estimates of the effect of training on earnings are given in Table 25.4. 
The outcome of interest is posttreatment earnings, RE78. One possible measure of 
the effect of training is the mean difference in RE78 between NSW treated and PSID 
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Table 25.3. Training Impact: Sample Means in Treated and Control Samples“ 


Variable Definition Treated Control 
AGE Age in years 25.82 34.85 
EDUC Education in years 10.35 12.12 
NODEGREE 1 if EDUC < 12 0.71 0.31 
BLACK 1 if race is black 0.84 0.25 
HISP 1 if Hispanic 0.06 0.03 
MARR 1 if married 0.19 0.87 
U74 1 if unemployed in 1974 0.60 0.10 
U75 1 if unemployed in 1975 0.71 0.09 
RE74 Real earnings in 1974 (in 1982 $) 2,096 19,429 
RE75 Real earnings in 1975 (in 1982 $) 1,532 19,063 
RE78 Real earnings in 1978 (in 1982 $) 6,349 21,554 
D 1 if received training (treatment) 1.00 0.00 
Sample size 185 2,490 


^ Data are the same as in table 1 of Dehejia and Wahba (1999). The treated group is the RE74 subsam- 
ple of the NSW. The control group is the PSID-1 sample of male household heads under 55 years 
and not yet retired. Treatment occurred in 1976-1977. 


control individuals, leading to the estimate $6,349 — $21,554 = —$15,205. This is 
called a treatment—control comparison estimator as it mimics the analysis in an 
experimental setting. It can equivalently be computed as the coefficient of the treat- 
ment indicator D in OLS regression of RE78 on an intercept and D, using a combined 
treatment—control sample. 

The large treatment estimate is misleading as it mostly reflects the difference in the 
types of individuals in the two samples — the control sample individuals are not good 
controls. This difference can be controlled for by including pretreatment characteristics 
as regressors, and estimating by OLS 


RE78; =x,8+aD;+u;, i=1,...,2675. (25.76) 


This leads to a much smaller estimated treatment effect @ = $218 when, following 
Dehejia and Wahba, the regressors x are specified to be an intercept, AGE, AGESQ, 
EDUC, NODEGREE, BLACK, HISP, RE74, and RE75. This approach is called the 
control function estimator in Section 25.3.3. 


25.8.3. Differences in Differences 


A second approach is a before—after comparison, which looks at the difference be- 
tween posttreatment earnings RE78 and pretreatment earnings RE75. Using mean 
earnings for the treated group leads to the difference estimate $6,349 — $1,532 = 
$4,817. 

This estimate may be misleading as it reflects all changes over this time period, 
such as an improved economy, and not just training. The difference-in-differences 
estimator, considered in Section 25.5, additionally calculates a similar quantity 
for the control group, $21,554 — $19,063 = $2,491, and uses this as a measure of 
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Table 25.4. Training Impact: Various Estimates of Treatment Effect 


Method Definition Estimate St. Error? 
Treatment—control comparison RE78p—; — RE78p=0 —15,205 656 
Control function estimator @ from OLS regression (25.76) 218 768 
Before—after comparison RE78p—; — RE75p-1 4,817 625 
Differences-in-differences @ from OLS regression (25.77) 2,326 749 
Propensity score See Section 25.8.4 995 - 


“ Standard errors for the first four estimates are computed using heteroskedastic-consistent standard errors from 
the appropriate OLS regression. 


nontreatment related changes over time in earnings, so that the change over time solely 
due to treatment is $4,817— $2,491 = $2,326. 

The DID estimator can be shown to be equivalent to the estimate of œ in the OLS 
regression 


RE;, = $ + 6D78;, + yaD; +aD78;,xD;+u;, i=1,...,2675, t = 75,78. 
(25.77) 


Here RE; 75 denotes earnings in the pretreatment period and RE; 7g denotes earnings 
in the posttreatment period, so the regression is one with 5,350 earnings observations. 
The indicator variable D78;, equals one in the posttreatment period, the indicator vari- 
able D; equals one if the individual is in the treated sample, and the interaction term 
D78;; x D; equals one for treated individuals in the posttreatment period. 

More generally, the intercept ¢ in (25.77) can be replaced by x’, 3. This makes no 
difference in this example where regressors are time-invariant so that x;; = x;. The 
method can be applied to repeated cross-section data (see Section 22.6.2) as it does 
not require that individuals in the treated and control groups be observed in both 1975 
and 1978. 


25.8.4. Simple Propensity Score Estimate 


A third approach compares the outcome RE78 for a treated individual with a counter- 
factual prediction of RE78 if the same treated individual had not in fact received the 
treatment. The initial treatment—control estimate of $15,205 is an oversimplified ex- 
ample that uses as counterfactual the average of RE78 in the control group ($21,554). 
Better counterfactuals can be generated by specifying a regression model. For exam- 
ple, the regression (25.76) specifies E[RE78|x] to equal x’G + a, if treated, with coun- 
terfactual x’, if not treated. This places restrictions on both the effect of regressors 
x and on the effect of treatment, which, conditional on x, is assumed to be constant 
across individuals. 

The treatment effects literature emphasizes counterfactuals that do not rely on 
such strong assumptions. An obvious approach is to compare treated and untreated 
individuals with the same value of x, but in practice such matching on regressors 
is not possible if several regressors are felt to be relevant and these regressors take a 
number of different values. 
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Post-treatment Earnings against Propensity Score 


Comparison_sample Treated_sample 


Real Earnings 1978 


Propensity Score Propensity Score 


e Original data Nonparametric regression 


Graphs by Treatment Status 


Figure 25.3: Training impact: post-treatment earnings plotted against propensity score by 
treatment status. Only observations with common support for the propensity score are 
included. Observations with earnings over $20,000 are excluded from the scatter plot, for 
readability, though they are included in the nonparametric regression. 


Instead, it can be sufficient, given assumptions detailed in Sections 25.3 and 25.4, 
to match on the propensity score, defined as the conditional probability of treatment 
Pr[D = 1|x]. For this example we estimate using only data for the initial year 1975 
the logit model 


Pr[D; = 1|x;] = A(x), i =1,..., 2675, (25.78) 


where, from Section 14.2, A(z) = e*/(1+ e7), and following Dehejia and Wahba 
(1999) the regressors chosen are AGE, AGESQ, EDUC, EDUCSQ, NODEGREE, 
BLACK, HISP, MARR, RE74, RE75, RE74SQ, RE75SQ, and U74*BLACK. 

Figure 25.3 plots posttreatment earnings RE78 against the propensity score, sep- 
arately for the treated and control samples. Considering just the propensity score (x 
axis) it is clear that most observations in the control sample have very low propen- 
sity score, an expected result given the Table 25.3 data that treated individuals were 
disproportionately black, unemployed, low-education individuals. 

Turning to the posttreatment outcome RE78 (y axis), we see that the treatment effect 
is estimated as the difference between a given treated individual (D = 1) and a control 
sample individual (D = 0) with the same (predicted) propensity score. Each panel 
in Figure 25.3 includes a fitted nonparametric regression of RE78 on the propensity 
score. The treatment effect is less than one thousand dollars over much of the range 
of propensity score, though it is considerably larger and positive for propensity score 
around 0.80. 

There are many ways to implement this approach of comparing individuals with 
similar propensity score and then averaging over all treated individuals. One strategy 
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is to match a treated individual with the control-sample individual who has the closest 
propensity score. This approach was labeled as the nearest-neighbor matching in Sec- 
tion 25.4.4. A simpler strategy is to stratify data by propensity score, denoted p(x), and 
let the counterfactual be the within-strata average of RE78 for the control group. For 
example, if a treated observation has propensity score p(x) = 0.35 then the counter- 
factual may be the average of p(x) for control group observations with 0.30 < p(x) < 
0.40. The total effect is then $`, ws(RE78, p—=1 — RE78, po), where RE78, p=) and 
RE78, p—o denote the strata s averages of RE78 for, respectively, the treated and con- 
trol observations, and the weights w, equal the fraction of treated observations in each 
stratum. A simple stratification scheme uses, say, 10 equally spaced strata with 0.0 < 
p(x) < 0.1, 0.1 < p(x) < 0.2, and so on. This was referred to as stratification match- 
ing in Section 25.4.4. This procedure should be restricted to cases where the propensity 
scores for the treated and control samples overlap, see Section 25.4.3. Here the propen- 
sity score ranges from 0.0005 to 0.9420 for the treated sample and from 0.0000 to 
0.9371 for the control sample, leading to dropping of 1,423 control group individ- 
uals and 8 treated individuals. The resulting estimated total effect is $995 given in 
Table 25.4. 


25.8.5. Matching Using Propensity Scores 


As mentioned in Section 25.4, other matching strategies include radius and kernel 
matching, which are also relatively easy to implement. The remainder of this chapter 
details these and other approaches, with emphasis on propensity score methods. 


Fitted Propensity Score 


The fitted propensity score is obtained using two different logit specifications, from 
Dehejia and Wahba (1999) and Dehejia and Wahba (2002), respectively. The specifi- 
cations for propensity scores are detailed at the bottom of Table 25.6. In the only de- 
parture from Dehejia and Wahba (1999, 2002), a constant term is included in our logit 
models. The estimated coefficients, not presented to save space, show an expected sign 
pattern. 


Matching Algorithms and Balancing 


An important practical issue is the choice of an appropriate matching algorithm based 
on propensity scores that ensures that balancing condition (25.9) is met. Dehejia and 
Wahba (2002, p. 161) provide an algorithm that starts with a parsimonious logit model 
to estimate p(x). The algorithm works as follows. The data are sorted according to 
P(x). The sample observations are stratified such that within a stratum the p(x) for 
treated and control units are close. For example, initially a rough grid with equal ranges 
may be used. Within each stratum the equality of means between treated and control 
units should be tested for each covariate. If there is no statistically significant differ- 
ence, then the regressors are balanced between the treated and control groups and one 
can stop. If, for some stratum, there is no balance, then for the unbalanced stratum a 
finer grid is used to achieve balance. If there are many unbalanced strata, then the orig- 
inal logit model is reestimated with an improved specification that includes interaction 
and higher order terms among the regressors. 
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Table 25.5. Training Impact: Distribution of 
Propensity Score“ for Treated and Control Units Using 
Dehejia and Wahba’s (1999) Specification 


Minimum P(x) Treated Untreated Total 


0.000364 9 960 969 
0.10 10 56 66 
0.20 14 33 47 
0.40 24 22 46 
0.60 33 7 40 
0.80 95 8 103 
Total 185 1086 1271 


“ From the second row, for example, the propensity score lies between 
0.10 and 0.20 for 10 treated and 56 untreated individuals. 


Using the software of Becker and Ichino (2002), Dehejia and Wahba’s (2002) algo- 
rithm is used to compute the propensity scores. In all of the cases noted, the propen- 
sity score computation has been restricted to the common support region by testing 
the balancing property using those observations whose propensity scores lie in the 
intersection of the supports of the propensity score of the treated and the control units. 
This restriction reduces the original sample significantly. The size of the control group 
drops from 2,490 units to 1,086 for the Dehejia and Wahba (2002) specification. 

Table 25.5 displays the number of treated and control units in different blocks after 
the balancing is carried out by the procedure just outlined. The reported results differ 
from those of Dehejia and Wahba (2002) because the latter exclude control units from 
NSW-PSID composite samples not on the basis of common support region but on 
the basis of whether the estimated propensity score of a sample unit is less than the 
minimum of the estimated propensity score for the treated units. The tables show that 
the proportion of treated units to control units is very low for the first blocks, compared 
with the remaining blocks. 

A similar exercise for the Dehejia and Wahba (1999) specification, not tabulated 
for brevity, leads to similar results. The control group has 1,146 observations. The 
boundary values for blocking p(x) are then 0.0006526, 0.05, 0.10, 0.20, 0.40, 0.60, 
and 0.80. 


ATET Estimates by Matching Methods 


A selection of results for various matching methods are summarized in Table 25.6. The 
nearest neighbor estimate of ATET for the Dehejia and Wahba (2002) specification is 
$2,385, and for the Dehejia and Wahba (1999) specification, it is approximately $560. 
The performance of stratification and kernel matching is also mixed, the estimates of 
ATET ranging from $1,452 to $2,156. 

For comparison, Dehejia and Wahba’s (2002) ATET estimates are reproduced in 
Table 25.7. We also note that the benchmark estimate of the treatment effect is $1,794. 
It is obtained by regressing RE78 on D for the Dehejia and Wahba’s (2002) version of 
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Table 25.6. Training Impact: Estimates of ATET 


Matching Number Number in Standard % of 
Procedure Treated Control ATET Error $1794 
Dehejia and Wahba (2002) specification” 
Nearest neighbor 185 53 2385 1209° 133 
Radius, r = 0.001 54 517 —7815 11187 —436 
Radius, r = 0.0001 24 92 —9333 22824 —520 
Radius, r = 0.00001 15 19 —2200 29864 —123 
Stratification 185 1086 1452 1041° 81 
Kernel 185 1058 1309 975° 73 
Dehejia and Wahba (1999) specification? 
Nearest neighbor 185 57 560 1098° 31 
Radius, r = 0.001 S] 583 —9358 9974 —522 
Radius, r = 0.0001 2 76 —7847 20664 —437 
Radius, r = 0.00001 16 13 223 4551% 12 
Stratification 185 1146 2156 814° 120 
Kernel 185 1146 1518 890° 85 


^ Logit Model: Pr[treat = 1] = h(CONSTANT, AGE, AGE”, EDU, EDU”, MARRIED, NODEGREE, BLACK, 
HISPANIC, RE74, RE74?, RE75, U74, U75, U74*HISPANIC). 

P Logit Model: Pr[treat = 1] = h(CONSTANT, AGE, AGE”, EDU, EDU? MARRIED, NODEGREE, BLACK, 
HISPANIC, RE74, RE74?, RE75, RE75?, RE74*RE75, U74*BLACK). 

© Bootstrapped standard errors with 200 replications. 

d Analytical standard errors. 

© ATET/1794 x 100. 


the NSW sample of both participants and nonparticipants. It is clear that the reported 
ATET estimates in this table differ significantly from those of Dehejia and Wahba 
(2002), as well as from the benchmark actual experimental estimate. For the Dehejia 
and Wahba (2002) specification, the nearest-neighbor estimator is very close to the 
benchmark estimate and is even better than the results of Dehejia and Wahba (2002) 
in terms of reduced bias. 

For stratification and kernel estimates, the bias is larger. For the radius matching 
estimator, this bias is worse, and gives negative estimates of the treatment effect as 
opposed to the positive estimates that Dehejia and Wahba (2002) found using caliper 
matching. The difference between our radius matching and the caliper matching of 
Dehejia and Wahba (2002) is that in the latter scheme, when a given treated unit does 
not have a match within the given caliper, matching is then done with the nearest 
comparison unit outside of the given caliper. In our case, if such a situation arises, we 
ignore treated units that have no match in the prespecified radius. This illustrates the 
sensitivity of the matching estimators to assumptions. 

The robustness of ATET estimates across specifications can be evaluated in terms 
of the ratio of ATET and the benchmark estimate, given in the last column of Table 
25.6. With the exception of the stratification matching estimator, the ratio varies widely 
over the two specifications. For example, the nearest-neighbor estimator is 133% of the 
benchmark estimator in the Dehejia and Wahba (2002) specification, but only 31% in 
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Table 25.7. Training Evaluation: Dehejia and Wahba’s 
(2002) Estimates of ATET 


Matching Procedure ATET Standard Error 


Nearest neighbor 1890 1202 
Radius, r = 0.001 1824 1187 
Radius, r = 0.0001 1973 1191 
Radius, r = 0.00005 1928 1196 
Radius, r = 0.00001 1893 1198 


the Dehejia and Wahba (1999) specification. Similarly, except for the kernel estimator, 
the ATET estimates are sensitive to the propensity score used. 

Whether matching methods work well depends on the suitability of the propen- 
sity score model for the treatment and control groups (Dehejia and Wahba, 2002). 
However, there is clearly an interaction between the methods and the propensity score 
model. 


25.9. Bibliographic Notes 


Early economic applications of matching and differences-in-differences methods to program 
evaluation include Ashenfelter (1978) and Ashenfelter and Card (1985). Treatment evaluation 
is currently a very active and fast-moving area of econometrics research. 


25.2 


25.3 


25.4 


25.6 


Angrist et al. (1996) make useful connections between the concepts and terminology in 
the medical and the econometrics literature. 

Heckman and Robb (1985) consider the estimation of program impacts in a variety of data 
settings, in the presence of selection. See also Björklund and Moffitt (1987). Heckman and 
Hotz (1989) also argue strongly that one needs to subject the results to several specification 
tests to assess their robustness and to evaluate the impact of selection bias. For example, 
they suggest the use of multiple comparison groups to evaluate the sensitivity of the results 
based on a single control group. Most of this earlier work is parametric in approach. More 
recently nonparametric methods have been used also. 

Heckman, Ichimura, and Todd (1997) and Heckman et al. (1998) study and apply match- 
ing estimators. The important result concerning conditioning on the propensity score is 
given in Rosenbaum and Rubin’s (1983, theorem 2). Efficient estimation of ATE using 
estimated propensity scores is analyzed in Hirano, Imbens, and Ridder (2003). Dehejia 
and Wahba (2002) apply propensity score matching methods to a variant of the Lalonde 
(1986) data set. The experimental data are matched with observations from the CPS and 
the PSID. Smith and Todd (2004) reanalyze the data used by Dehejia and Wahba using 
a number of different variants of propensity score estimators. They highlight the biases 
associated with alternative propensity score estimators and emphasize the importance of 
high-quality data in bias minimization. Becker and Ichino (2002) provide an overview of 
some propensity score matching estimators. They also provide a set of STATA programs, 
with illustration, that can be used for estimating ATET. The February 2004 issue of the 
Quarterly Journal of Economics includes a symposium on the econometrics of matching. 
Hahn, Todd, and Van der Klaauw (2001) analyze identification of treatment effects in the 
RD model under weak assumptions. 
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25.7 Imbens and Angrist (1994) analyze the properties of the LATE estimator. Angrist et al. 
(1996) discuss the use of IV methods and make a connection with the LATE measure of 
treatment impact. The article is followed by a lively discussion that gives a spectrum of 
views on the IV estimator as well as literature connections, see also Heckman (1997). 
Angrist (2001) discusses some simple strategies for dealing with endogenous dummies in 
nonlinear outcome models with nonnormal outcomes. The paper is followed by discus- 
sion and comments that analyze the pros and cons of the linearized IV approach. There is 
lack of consensus on the most promising among the competing approaches. Heckman, To- 
bias, and Vytlacil (2003) develop estimators for treatment effects within a latent variable 
framework. Vella and Verbeek (1999) compare the IV approach with a control function 
approach that includes a selection bias correction term. 


25-1 


25-2 


25-3 


25-4 


25-5 


Exercises 


(Adapted from Heckman, 1996) Consider the treatment-outcome model y= 
X'B + ad+ €, where dis a binary indicator variable taking the value 1 if treat- 
ment is assigned randomly and 0 if treatment is not assigned (also randomly). 


(a) Is randomized treatment a sufficient condition for identification of a? 
(b) Is randomized treatment a sufficient condition for identification of w and 3? 


In the previous problem randomization refers to treatment. Here we consider 
randomized eligibility for receiving the treatment. Now e = 1 means that an in- 
dividual is randomly made eligible and e= 0 means randomly made ineligible. 
Show that in this case, given Pr[d = 1|x] 4 0, the treatment effect is given by 
E[ye=1,x]— E[yje=0, x]/ Pr[d = 1|x]. 

Consider the nonlinear treatment outcome model E[y|x,d] = exp(x’G+ ad), 
where d is a binary treatment indicator. Suppose that we have available con- 
sistent estimates of (6, œ) and an estimated covariance matrix VIB, @]. Assume 
that the estimator is asymptotically normal. Outline a bootstrap or a Monte Carlo 
algorithm for estimating the ATE parameter and its asymptotic variance given 
(x;, d), i =1,..., N. 

Consider the nonlinear treatment outcome model E[In yjx,d] =x’B+ ad, 
where d is a binary treatment indicator. Suppose that we have available con- 
sistent estimates of (3, ~) and an estimated covariance matrix VIB. a]. Suppose 
we are interested in estimating the ATE in terms of y rather than In y. Suggest 
an estimation method and discuss its consistency property. 


In this chapter the empirical illustration used the PSID control group and the 
NSW treatment group. Dehejia and Wahba (2002) used two control groups. 
There is another control group available based on the CPS. In this exercise 
you will be asked to replicate some of the calculations reported here using the 
CPS control group in place of the PSID sample. 


(a) Generate a table similar to Table 25.3. Compare the NSW group with the 
CPS controls in terms of age, ethnic composition, educational attainment, 
and pretreatment earnings. 

(b) The differences between the treatment and control groups can be viewed 
using the estimated propensity score, as was done in Section 25.8. Using 
the approach of Section 25.8.4 estimate the propensity score for the 
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NSW-CPS composite sample, incorporating the covariates linearly and 
with higher order terms, as in Dehejia and Wahba (2002). Ignoring those 
comparison units whose propensity scores are less than the minimum of 
the treated units, compare the two sets of propensity scores using a his- 
togram. Comment on the goodness of match with comparison units in dif- 
ferent propensity score intervals (“bins”). 

Using the matching methods described and implemented in Sections 25.8.4 
and 25.8.5 (especially nearest-neighbor, stratification, or interval match- 
ing, kernel matching, and radius matching), construct a table similar to 
Table 25.6. Comment on the estimates of ATET and compare them with 
those based on the PSID comparison group. 
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Measurement Error Models 


26.1. Introduction 


Problems of measurement error pervade all econometrics. In microeconometrics, a 
common source of the measurement error problem comes from incorrect response to a 
survey question, incorrect coding of a correct response, and the use of a correctly mea- 
sured variable as a proxy for another theoretically valid but unobserved variable (e.g., 
using observed income as a proxy for “normal income”). Questions that seek sensitive 
information may elicit partial or incorrect responses. That is, a measurement error is 
triggered by unobservables (or latent variables) when such variables are replaced by 
proxy variables. 

Here are some examples. Consider the problem of testing for the presence of gender 
bias in a study of earnings. The obvious approach is to regress a measure of earnings 
on a categorical gender variable while controlling for qualifications, age, experience, 
and so forth. However, the most relevant variable may be an individual’s on-the-job 
productivity, which may not be directly observed and a proxy may be used instead. 
Therefore, the impact of measurement error on inferences about the gender discrim- 
ination is an important issue. Studies of individual demand for goods and services 
feature concepts such as “economic cost” or “full price of a service.” However, these 
are rarely directly measured in published data and must be constructed by the econo- 
metrician prior to model estimation. Inevitably their measurement is subject to error. 

There are virtually no models discussed in this book that are protected from the 
problem of measurement errors. Binary outcome endogenous or exogenous variables 
are potentially subject to classification errors; transition and count data collected from 
retrospective surveys are affected by recall errors; data on relatively unambiguous vari- 
ables such as hourly earnings and household expenditure are distorted by deliberate 
exaggerations and/or reporting errors. Unlike aggregate data where aggregation may 
result in some cancellation of measurement errors, for individual-level data measure- 
ment errors persist. 

In the first part of this chapter we study the consequences of measurement errors 
and estimation strategies for remedying the consequences. Both linear and nonlinear 
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models are considered. Although it is more realistic to acknowledge that the problem 
usually occurs in combination with others, it is convenient for exposition to suppose 
that the only problem confronting the econometrician is measurement error. 

Broadly speaking the consequence of errors of measurement is a failure to iden- 
tify the parameter of interest. The issue of fixing the problem is complex. One may 
consider simply omitting the relevant variable in the model or substituting a proxy for 
the true measure. There are at least two important reasons for not doing so except in 
extreme cases. First, if the variable is of central interest, then omission lends to se- 
rious omitted variable bias, so one is substituting one type of problem for another, 
and identification is still not possible. Second, in a linear regression, using a proxy for 
the latent variable will have smaller asymptotic bias than simply omitting the variable 
from the model, provided the measurement errors are random and independent of the 
true regressor (McCallum, 1972). Ignoring the variable provides inferior estimates. 
However, using the proxy still gives inconsistent estimates even though the biases 
are smaller. 

The essential insight underlying the solution of the measurement error problem 
is that to recover the parameter of the latent variable and to identify the model, it 
is necessary to have extraneous information in the form of additional assumptions 
about the measurement error or obtain additional data and to use this information after 
invoking plausible assumptions. This is a popular approach. However, when additional 
data are unavailable, an econometric model makes a good alternative. 

Measurement errors have potentially very serious consequences since in many cases 
they lead to regression parameters becoming unidentified. For example, Card (2001) 
reviews empirical evidence on the coefficient of schooling on earnings and finds that 
the typical downward bias is of the order of 25-35%. The precise consequences of 
measurement errors may depend on the functional form of the model, how the errors 
enter the model (e.g., additively or multiplicatively), and the data structure under con- 
sideration. The solution of the problem resulting from measurement errors typically 
requires introduction of additional information into the model, either in the form of 
additional data or additional assumptions. 

It is convenient to organize the discussion of measurement error models into sep- 
arate sections on linear and nonlinear models, and then to consider special cases. 
Sections 26.2 and 26.3 are devoted to linear regression. Section 26.4 covers nonlin- 
ear regression. Section 26.5 discusses some Monte Carlo examples. Essential insights 
provided by linear models provide a useful basis for understanding the results for non- 
linear models. In all cases clearer results are usually available for specific models. 


26.2. Measurement Error in Linear Regression 


Measurement error in the regressors, also called error-in-variables, is an important 
topic as it leads to inconsistency of the OLS estimator even if the measurement error 
has zero mean. Measurement error in the regressors is often said to lead to bias, but we 
use the stronger term inconsistency as the bias does not disappear as the sample size 
goes to infinity. 
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Measurement error models have a broad scope and they cover situations in which 
the measurement error affects the right-hand-side variables (“regressors”), or the left- 
hand-side variable (“outcome”), or both. Hausman (2001) refers to them as “problems 
from the right” and “problems from the left.” In the latter case, usually referred to as 
the classic errors-in-variable model, the relationship of interest is between the outcome 
y and covariates (W, Xx"), where W is measured without error and X* is not observed 
but a proxy for it, denoted X, is available. The question of interest is whether an 
estimated relation between y and (W, X) provides a satisfactory basis for inference 
regarding X*. 

In the statistical literature it is conventional to distinguish between the functional 
and structural approaches to measurement error models. If X* denotes the true un- 
observed covariates, then the functional approach regards these as unknown fixed con- 
stants (parameters). In the structural approach they are treated as random variables. 
Carroll, Ruppert, and Stefanski (1995) further distinguish between functional model- 
ing in which only minimal assumptions are made about the Xs, regardless of whether 
they are fixed or random, and structural modeling in which parametric assumptions are 
made regarding the distribution of the Xs. Functional measurement error models are 
examples of models with infinitely many nuisance parameters for which the maximum 
likelihood method has well-known deficiencies (discussed in the panel data chapters). 
This distinction is less common in the econometrics literature. 

The magnitude of the inconsistency can be substantial in applications. There is a 
particularly extensive discussion of measurement error, and ways to control for it, in 
econometric studies of the determinants of individual earnings. 


26.2.1. Classical Measurement Error Model 


The standard measurement error model has a continuous dependent variable y that is a 
linear function of K true regressors x*. An additive measurement error in y may cause 
no problems if it is uncorrelated with the regressors because it can be absorbed into 
the error on the equation. If x* were observed then parameters could be consistently 
estimated by OLS regression of y on x*, 


Yi =X B+ Ui, 


where u; are iid [0,07]. Instead, the observed data are x Æ x", and y is regressed 
on x rather than on x*. The relationship between the true and observed regressors is 
postulated to be 


x =x +v, i=l,...,N, (26.1) 
where the additive measurement errors are assumed to be distributed as 
v;~[0, Z]. (26.2) 


The unobserved true regressors are assumed to have mean zero, so variables are mea- 
sured as deviations from mean and to have variance matrix 


VE] = Syw. (26.3) 
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Note that x is an unbiased estimate of x*, since the measurement error v is assumed to 
have mean zero. The measurement error is assumed to be independent of both x* and 
the regression error u, 


E[v,|x;] = E[vilui] = 0. (26.4) 


26.2.2. Inconsistency of OLS 


To consider the consequences of measurement error it is helpful to write the assumed 
dgp for the classical measurement error model in matrix notation as 


y=X*ß +u, (26.5) 
X=X*+V, 


where u, the equation error, obeys the conditions E[u|X*] = 0 and E/uu’ [X*] =o'ly. 
Substituting the second equation into the first yields 


y=X6+(u-V~). (26.6) 


An OLS regression of y on X will lead to an inconsistent estimate of 8, since the error 
term (u — V Ø) is correlated with the regressor X through the measurement error V. 
Formally, we have 


plim N~'X’(u— VB) = plim N~!(X* + V)'(u— VG) 
FA -Zw 
#0, 
using N-!V’V =N7! >>; viv; and v; iid [0, X]. This is the essential source of incon- 
sistency. Now 
plim N~'X’X = plimN~!(X* + V)(X* + V) 
= Dyre F Xw, 
where we have used the iid property of x¥ with mean zero and V[x*] = Xxx. Also, 
plim N~'X’y = plim N~!(X* + V)'(X*6+u) 
= Zerb 
+ 0, 
so that, applying Slutsky’s theorem (Appendix A, Theorem A.3), we get 


plim ĝ = (plim N'X'X) plim N~'X’y (26.7) 
= (Sa (Xxx a Ly )G 
= B _ (Err + Da). Xwb. 
Clearly, OLS is inconsistent as long as there are measurement errors and Xy, Æ 0. 


For later reference note that if we have available a consistent estimate of Ew, 
denoted Sw, and if (X/K — Sw) is positive definite, then the adjusted least-squares 
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estimator Ba = (X’X — Sw)`}X'y can be computed. This formula can also be used to 
study the impact of hypothetical values of measurement error variances on the least- 
squares estimator. 


26.2.3. Measurement Error with a Scalar Regressor 


A special case of this model that routinely features in textbooks involves the case of 
a single true or unobserved regressor x* with variance o2, observed value x, zero- 
mean measurement error v, and associated variance oP. That is, the regression is y = 
Bx* + u, where E[u|x*] = 0, V[u|x*] = a. and Cov[v, u] = 0, but in estimating the 
regression x* is replaced by the observed variable x. 

In this case, (26.7) specializes to 


o2 


lim ĝ = ——" — 26.8 

plim £ TEE (26.8) 
1 

1+ 02/02, 


= BlUl—s/Q+s)], 


where s = 07/02, is often referred to as the the noise-to-signal ratio and the entire 
term (1 + s)~! is referred to as the reliability ratio. Asymptotically B is downward 
biased toward zero to an extent that depends directly on the noise-to-signal ratio. This 
bias is also called attenuation bias. The terminology is intuitive since it suggests that 
a researcher’s estimate of the marginal impact of change in x* on y is attenuated by 
the presence of measurement error in x*. 

Note also that 


20202 
V[y|x] = o? +5 Z > og. 
a o; 


This implies that measurement errors not only cause attenuation bias but they also 
inflate the equation error variance. Unambiguously, a reduction in the variance of the 
measurement error will reduce the residual variance of the equation. 

Had an intercept term been included in the bivariate regression just presented, this 
would bias upward the least-squares estimator of the intercept, y — Bx, where (y, x) 
are sample averages that are still consistent estimates of the respective population 
means. Cragg (1994) suggests the term “contamination bias” for this effect of mea- 
surement error on another regression parameter in the equation. 

As an example, consider regression of log hourly wage on years of schooling. Sup- 
pose years of schooling x* are measured with error, and assume that the standard de- 
viation of true years of schooling is 2 and the standard deviation of the measurement 
error is 1, so that oĉ, = 4, o? = 1, and o? = 5. Then plim B = 0.8 x B. For exam- 
ple, an OLS estimated slope coefficient of 0.04 means that one more year of school is 
actually associated with a 5% higher wage rather than a 4% higher wage. 
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26.2.4. Extensions 


In extensions and generalizations of this simple but elegant result, researchers often 
ask if attenuation bias is a general feature of measurement error models, and what if 
anything is attenuated. Although the result does not necessarily carry over to more 
general models, it does provide a useful benchmark. Hausman (2001) has called the 
attenuation bias caused by measurement error the “Iron Law of Econometrics.” 

If the measurement error is assumed to be uncorrelated with the true unobserved 
value, the measurement error is said to be “classical.” Although convenient, this as- 
sumption may not hold. Indeed in some cases it cannot hold. For example, if x is a 
binary 0/1 variable, the measurement error will be a classification error. If, owing to 
misclassification, a 0 is measured as a 1, and vice versa, then the measurement error 
must be correlated with the true value. 

When there is more than one regressor, let X* = [x* Z], where as in the preceding 
case we assume that only one regressor is observed with measurement error, that is, 
x = x* + v. Then the expression for the least-squares estimator of the coefficient of x 
becomes 


An o2 
plim Bz = B} 1 ; (26.9) 
| om (1— R27) +02 


where R2, z denotes the R? in the auxiliary regression of x* on Z. The formula (26.9) 
is essentially the same as (26.9), provided we reinterpret the variance of x* to mean the 
variance after controlling for or removing the linear influence of Z on x*. Once again 
the inconsistency of the least-squares estimator is toward zero, though by a smaller 
multiple of £ than in the single regressor case. The coefficients of the regressors mea- 
sured without error are also inconsistent, in a direction that depends on Xxx: (Levi, 
1973). This effect can once again be thought of as contamination bias. The attenuation 
bias that is demonstrated in these special cases depends critically on the assumption of 
additive measurement errors. 

When more than one regressor is measured with error general results on the direc- 
tion of the inconsistency are no longer available, though in any given problem they 
can be determined given knowledge of Exx and Eyy. Most studies consider measure- 
ment error in only one regressor, in which case the inconsistency is toward zero. The 
intuition from the foregoing examples is that if the measurement errors on different 
regressors are independent, then each source will contribute to the attenuation bias of 
its “own” coefficient, and all will contribute to the inflation bias of the conditional 
variance. Cragg (1994) analyzes a multiple regression model with measurement errors 
and shows the interactions among biases from different sources. 


26.2.5. Measurement Error in Linear Panel Models 


The effects of measurement error in regressors are compounded when panel data are 
used. 
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Assume a pooled panel model y;, = Bx}, + Uir, where we observe xj, = x; + vir, 
and a scalar regressor is assumed for simplicity. The preceding results still hold if we 
estimate a single cross section. However, if we estimate using more than one year of 
data for each individual we need to adapt the previous results, since the regressor x} 
will most likely be positively correlated, rather than independent over t for given i. For 
example, if we do the first-differences regression 


Ayi = BAx;, + Atir 
= BAX; + Aui — Avi 


it? 


(see Section 21.6) and define o = Cor[x* X71) then 


“1 
i ne he sine 
plim $ = B+ (pim W Yan) (pim W 2 a 3 pauan) 


=ý 260o? 

7 2(1 — p)o2. + 202 
Bo, 

=£ 


AETA 


using V[Av;;] = 2V[vj;] and V[Axž] = 2(1 — p)V[xšž]. 

The inconsistency is larger than in the cross-section case if ọ > 0. Moreover, as 
p — 1, as can happen with panel data, the inconsistency becomes very large. This 
inconsistency can be decreased by using differences that are m > 1 lags apart because 


Cor[x;,, Xi 1m will be decreasing in m. 


26.3. Identification Strategies 


It is conventional to say that without additional assumptions the errors-in-variables 
model is not identified. This statement can be interpreted as follows in the context of 
the special case of the bivariate model. An estimated value of B, or more precisely its 
probability limit, is consistent with infinitely many different combinations of 6 and 
s, the noise-to-signal ratio. If, however, additional assumptions or information can be 
brought to bear on the problem, it may be possible to rule out some combinations of 
the underlying parameters that are consistent with the observed data distribution. If 
the additional restrictions are just sufficient to obtain a unique solution, the model is 
said to be exactly identified. If the additional restrictions are more than sufficient to 
uniquely identify the model parameters, the model is said to be overidentified. 

A general identification strategy for the measurement error model is to obtain 
bounds rather than point estimates of the parameters of interest if there is no further a 
priori information or data. If additional data and/or information about measurement er- 
ror are available then additional identification strategies, such as instrumental variables 
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estimation or identification through moment restrictions, become feasible. Additional 
information about the measurement error is a broad concept that includes one of the 
oldest identification strategies, one using instrumental variables that link the true un- 
observed variables to their observable counterparts. For example, additional infor- 
mation may yield a consistent estimator for the attenuation factor, o2./(o2. + 07), 
making it possible to adjust the inconsistent estimate for the bias. Finally, repli- 
cated data or validation data may be available, and these can yield useful informa- 
tion about the moments of measurement error. These possibilities are analyzed in the 
following. 


26.3.1. Setting Bounds on Regression Parameters 


Reconsider the multiple regression problem of Section 26.2. The model given there 
is subject to the requirement that the variances Ly, Uyy, and o? must be positive 
semidefinite. This together with the orthogonality conditions of estimation can be used 
to place some bounds on the region in which the coefficients must lie. Klepper and 
Leamer (1984) and Wansbeek and Meijer (2000) consider the problem in some gener- 
ality. A more accessible special case of the bounds approach is the reverse regression 
approach considered next. 


Reverse Regression 


In a simple bivariate regression model with variables (y, x), direct regression refers 
to the regression of y on x, whereas reverse regression refers to the regression of 
x on y. In the general multivariate regression case with K covariates, there is only 
one direct regression but there are K reverse regressions. Each reverse regression 
has a mismeasured exogenous variable on the left-hand side and the remaining ex- 
ogenous variables and y on the right-hand side. In the bivariate regression case with 
measurement errors, it is easy to show that the estimated slope coefficients from the 
direct and reverse regressions place lower and upper bounds on the value of the true 
slope coefficient. This is a potentially useful result in analyzing the effects of measure- 
ment errors. Leamer (1978) provides an excellent discussion of the logic of reverse 
regression. 

First, we consider the logic of reverse regression by reference to a simple bivariate 
regression model with measurement errors: 


y = Bx* + u, (26.10) 


x =x" +v, 


where u is the regression error and v is the measurement error that accounts for the 
difference in the observed variable x and the error-free measure x* that enters the 
regression. We will assume that u ~ M[0, 07] and v ~ N[0, oĉ]. 
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Following the structural approach of Solari (1969) (and Leamer, 1978), treat x* 
as unknown parameters in the likelihood function. The joint likelihood given data 


(y, x) is 


x (eae exp | aa (x* x) (x J ; (26.11) 


This function is not defined at points that satisfy the conditions 07 = 0 and x* = x, 
or the conditions oĉ = 0 and y = fx*. If we simply minimize the well-defined parts 
of this likelihood subject to the constraints we get two scalar regression parameters, 
Bp = y’x/xx for the direct regression and Br = yx/y’y for the reverse regression. 
To aid intuition, notice that if x is measured without error then y is stochastic and x 
is not, so direct regression has a meaningful conditional expectation interpretation, 
and if only x is stochastic (measured with error), then the conditional expectation 
E[x|y] is meaningful, because the two-equation system then reduces to x = (1/8) y — 
u/B + v. That is, the reverse regression produces the least-squares estimate (1/8). Itis 
straightforward to verify that 


r2 Êr = fp. (26.12) 
Bp < B < Br, 


where i is the simple squared correlation between x and y; the bounds indicate that 
Bo is a downward biased estimate of 6 and Br is an upward biased estimate. Note 
that these bounds can be very broad in microeconomic data where i < 0.5 is almost 
always the case and even r?, < 0.1 is quite common. 

Leamer (1978) considers the model in which (y, x*) has a bivariate normal distri- 
bution with mean ($x*, x*) and covariance matrix 


24 B2_2 2 
Bal ee eee Pie (26.13) 
He shows (Leamer, 1978, pp. 239-240) that the likelihood function for this model 
attains its maximum at any value of 6 between the direct regression estimator Bp and 
the reverse regression estimator Br: 

The foregoing analysis suggests that even though £ is not identified, consistent 
bounds can be placed on its value. This is a potentially useful application of bounds 
identification. The result can be extended in a straightforward manner to the case of 
multiple regression in which only one regressor is measured with error (Bollinger, 
2003). Klepper and Leamer (1984) consider an extension to the multiple regression 
case of K regressors, all of which are measured with error. There is one direct re- 
gression and K reverse regressions. After estimation each reverse fitted regression is 
renormalized with a unit coefficient for y on the left-hand side. Then Bp is the esti- 
mated vector from the direct regression, and Br. ¿Q =1,..., K) is the vector from 
the jth reverse regression. By the results of Klepper and Leamer (1984), if the direct 
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and reverse regression coefficient vectors are all in the same orthant then the set of 
feasible values of 8 is the convex hull of the direct and reverse regressions; that is, 
Be {Bl B= = AvBp + MBr 1+: t+ dy Bechs where the A-weights are nonnegative 
and sum to one. The smallest coefficient in the direct and reverse regression vectors is 
the lower bound, and the largest coefficient is the upper bound. These bounds do not 
exist if the coefficient changes its sign. 

In addition to the work of Klepper and Leamer (1984), there are several studies 
that use these ideas in applied contexts. Greene (1983) and Goldberger (1984) apply 
reverse regression to measurement of salary discrimination. Bollinger (2003) analyzes 
measurement of the black-white wage gap in a model of wages and human capital. 
Bollinger (1996) applies the bounds approach to the case of regression on a categorical 
dummy variable in which observation categories are misclassified. 


26.3.2. Identification Using Instrumental Variables 


One solution to the identification problem is to introduce one or more moment restric- 
tions that constitute further identifying information. A moment restriction typically 
states that there is available an instrumental variable that is correlated with, or causally 
related to, the variable that is measured with error. Moreover, this variable is uncorre- 
lated with, or causally unconnected with, the outcome that is being modeled. Adding 
this restriction to the original model helps in principle to solve the identification 
problem. 

Historically, the IV estimator was suggested as a potential solution for the measure- 
ment error problem in linear models (Reiersgl, 1941; Durbin, 1954). The IV approach 
is similarly motivated when one or more variables on the right-hand side are endoge- 
nous and hence correlated with the regression error. The linear simultaneous equation 
model and the linear measurement error model are isomorphic and hence the use of 
IV-type estimators in the context of measurement errors is natural. 

Reconsidering the linear IV model of Sections 4.8 and 6.4, where y = XG+u 
and E[u|X] 4 0, we can use the 2SLS estimator if a valid set of instruments Z, 
dim[Z] > dim [X] is available. 

One can test for the presence of measurement error using a Hausman test of endo- 
geneity of regressors, see Section 8.3. Several variants of the test are available, and 
one variant was given in Section 8.4. 

A major problem in implementing the IV estimator lies in the practical difficulty 
of finding valid instruments. Good instruments have two properties: zero correlation 
with equation errors (for consistency) and high correlation with variables being in- 
strumented (for efficiency). Such instruments are not typically easy to find. Although 
ideally one should explicitly derive valid instruments from detailed specification of 
relationships between regressors and covariates, in practice ad hoc methods are com- 
mon. Unlike the full system specification approach, the ad hoc method is simpler and 
less demanding. Notice that the conditions for the validity of instruments do not create 
an automatic procedure for selecting one. These technical conditions could be satisfied 
by a variable that is causally unconnected with the phenomenon under study. One has 
to think of a variable that correlates strongly with the regressor(s) and is uncorrelated 
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with the equation error. A number of interesting applications of this idea are avail- 
able in the literature; see, for example, Angrist (1990). If selected, the use of such an 
instrumental variable may be controversial and puzzling. 

We consider several possible instruments for the cross-section regression of earn- 
ings on schooling example. First, if data are available on siblings then the schooling 
level of a sibling may be used as an instrument, since the education levels of siblings 
are likely to be correlated. Consistency of the IV estimate then requires no correla- 
tion between the measurement error v and any measurement error in schooling of the 
sibling. Second, more generally other variables related to schooling such as parents’ 
educational level or income may be used. Casting a broader net, however, runs the risk 
of leading to instruments that are only weakly correlated with x, leading to imprecision 
and possible poor finite-sample properties of the IV estimator. Third, more than one 
question on schooling level may have been asked in the survey, or schooling level may 
be available from surveys in other years if data are from a panel study. Such instru- 
ments are likely to be highly correlated with x, but the assumption of no correlation be- 
tween measurement errors in x and z may be more difficult to believe in this example. 

Lagged variables are frequently used as instruments, but these too will have mea- 
surement errors, so the approach is minimally satisfactory only if serial correlation in 
measurement error is not a problem. 

The effect of measurement error can be large in the panel context. Since panel data 
provide measures of x}, in multiple periods, instrumental variables estimation can be 
used to provide consistent parameter estimates assuming uncorrelated measurement 
errors across the time periods. See Hsiao (1986, pp. 63-65). 


26.3.3. Identification via Additional Moment Restrictions 


Distributional assumptions about the equation and measurement errors (u, v) can se- 
cure identification. There is one important case in which the identification is aided 
instead by information or assumption about the distribution of the unobserved true 
value of the mismeasured variable. The assumption of joint multivariate normality of 
(y, x,x*), together with the assumption that the measurement error v and equation 
error u are, respectively, iid \V[O, o2] and iid M[O, o?], are not sufficient to identify 
the measurement error model. However, the assumption that the first four moments of 
(x*, u, v) exist and that the third moments of each and the third cross-moments are not 
zero, indicating a departure from normality, is sufficient to secure identification, as we 
now demonstrate. 
Let us reconsider the model (26.10) 


y = px* +u, 
x =x* +v, 


whose reduced form y = Bx + £, where € = u — pv, is to be estimated by an instru- 
mental variables procedure. However, we now add a new piece of information: that the 
distribution of x* is nonnormal in the sense that it is both skewed and has nonnormal 
(excess) kurtosis Cragg (1997) Dagenais and Dagenais, 1997; Wansbeek and Meijer, 
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2000). These assumptions imply the following six conditions: 


E[(xy)x] = $E [x°], E[(xy)u] = 0, 
E[(x*)x]=E[x"]+E[v'], E[(*)u] = -pE [°], 
E[(y*)x] = PE [x°], E[(y*)u] = -2E [6°]. 


The first row implies that the product variable x;y; is a valid instrument if E[x?3] F 
0. The second row implies that x7 is a valid instrument if E[x**] 4 0, but E[v;] = 
0; that is, if x* is nonnormal but v has a symmetric distribution. Indeed, the greater 
the skewness the better is the instrument. However, because x* is unobservable, any 
inferences about it will need to be based on x itself. The last row implies that y? is a 
valid instrument if the third moment of x* is nonzero but the third moment of € is zero. 

Given these moment conditions, the IV approach can be applied to consistently 
estimate the model parameters. This example illustrates how additional moment as- 
sumptions can help generate useful instruments even when no data other than (y;, x;) 
are available. 


26.3.4. Replicated Data 


An alternative solution is possible if the measurement error variances can be estimated. 
The basic idea here is that we can adjust the sample second-moment matrix X’X of the 
regressors by an amount that depends on the variance and covariances of measure- 
ment errors. Notice that we do not try to adjust the observations themselves. Instead, 
the sample moments are adjusted because the estimator is a function of those sample 
moments. This key idea generalizes to more complex models also. 

When the measurement error variance X is known, a consistent estimate of 3 can 
be obtained using 


B = (X'X-NDw) X’y, (26.14) 
where N is the sample size. This is consistent since 
B = plim(N~'!X’X — Ew)! plim N7!X’y 

= (Sr + Sy — Ew)! Zex b 

=, 
where plimN~!X’y = ¥y.,.3 is obtained using X=X*+V and y= X8 + 
(u — VQ). For a detailed account of ways to estimate X,, in a substantive applica- 
tion, see Krashinsky (2004). 

Data replication is a situation in which an unbiased estimate of the unobserved X* 
is available. Suppose that the measurement error is additive and we have an observable 
X: 

X = X*+V. 


If X is an unbiased estimate of X*, then E[V|X*] = 0. If data are replicated, this 
simply means that we have at least two measurements available on X. It also means that 
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with multiple measurements we can obtain estimates of the moments of V, assuming 
the measurement errors for multiple measures are uncorrelated. 

Suppose there are two scalar measurements (replicates) Xa) and Xo), such that 
Xo) =X% + Vij), J = 1, 2. Then VI Vo] = E[X?,] = E[X (1) Xa], which can be esti- 
mated by the sample average N“! 5; [Xeni — Xa), i Xoi]. Then the regression param- 
eters can be estimated using Equation (26.14). 

For example, suppose we wish to predict grade point average (GPA) in the first 
year of college using performance on the SAT exam taken in high school. It is known 
that observed SAT scores for a given person vary across different takes of the exam. 
Let x* denote the true SAT score, and let xı and x. denote the observed SAT score 
on two separate SAT exams. Then x; = x* + v1, X2 = x* + vo, and it is assumed that 
vı and v are independent with equal variance oĉ. It follows that Cov[x1, x2] = o2, 
V[x1] = V[x2] = o2 + oè, and Cor?[x,, x2] = o2, | Come + oĉ). Studies find the tests 
to have a reliability of 0.9, which means that the correlation from one test to the next 
is 0.9 and the squared correlation is 0.81. Thus oå /(o2. + o2) = 0.81. It follows from 
(26.8) that plim ĝ = 0.81 x £, so that because of measurement error SAT scores are 
as stronger a predictor of first-year college GPA than OLS regression suggests. 


26.3.5. Validation Data 


Sometimes a validation sample is also collected as an additional check on the origi- 
nal responses. Although the validation sample pertains to the population of interest, 
it may come from a different independent source. For example, patients may respond 
to a questionnaire about medical services received, and providers of services may re- 
spond to a validation survey. Another example is that of employees who may provide 
some information about an event, and the information may be validated by the same 
information obtained from the employers. A leading example in economics is the PSID 
validation study of Bound et al. (1994). 

Let X be an N x K matrix of observations on regressors measured with error, and 
let X, be an M x K matrix of validation data. We can use validation data by regress- 
ing the columns of X, on X, and generating “predicted” values X [x’x] X’X, that 
replace the error-contaminated matrix X. For nonlinear models more complex proce- 
dures are used, see Lee and Sepanski (1995). 

The use of generated regressors that are substituted into the regression of interest 
can be a practical useful strategy if the predictions come from a well-fitting regression. 
Generated regressors are estimates of the true values and hence subject to estimation 
uncertainty. As such this uncertainty should be taken into account in estimating the 
sampling variance of the regression coefficients. The relevant theory was covered in 
Section 6.8. 


26.4. Measurement Errors in Nonlinear Models 


Nonlinear models, as should by now be abundantly clear, comprise a bewildering ar- 
ray of models. Obtaining general results, such as attenuation bias, that apply to a broad 
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class of models poses a major challenge. Not unusually, general results are obtained 
under simplifying assumptions, whereas more specific results can pay more attention 
to complexity and specificity of particular data situations. Therefore, it is not surprising 
that the development of this topic in the literature has produced many procedures and 
approaches that are specific to particular models. For example, in dealing with binary 
outcome models with left-hand-side measurement error it is natural to focus on the 
problem of misclassification; in dealing with count models also with left-hand-side 
measurement error it is equally natural to focus on the issues of under- and overre- 
porting. Motivated by this difficulty, Hsiao (1992) recommends shifting attention from 
providing solutions for a general model to a specific type of question. In covering 
model-specific results, there is a danger of being compendious and of losing sight of 
general results. We therefore begin with some selected general results. 


26.4.1. Identification through Instrumental Variables 


A general technique in the linear errors-in-variables model is the instrumental vari- 
ables method. For the nonlinear (in regressors) regression model, Y. Amemiya (1985) 
showed that the IV estimator is generally inconsistent, being consistent only under the 
assumption of a shrinking error variance—covariance matrix. 

A simple exposition of the aforementioned point is based on the regression equation 


y = o+ bifa) +e, (26.15) 


where f(x*) is a smooth, differentiable, and bounded function of an error-free scalar 
regressor x*. The observed variable x = x* + v, where v is a measurement error. Sub- 
stituting for x* and taking a Taylor expansion of f(x — v) around x yields 


(oe) 


y = bo +f te — Bi fO aw + Bid) fP, (26.16) 


j=2 


where f‘/)(.) denotes the jth derivative of f(-). Consider the quadratic case f(x) = 
x? + yx, so f(x) = 2x +y, f(x) =2, and f(x) = 0, j > 2. Then 


y = fo + bı (x? +yx) +e- fi 2x +y)v + Bi2v?/2 
= bo + Bix? + Biyx + (e — Bixv — Biyv + Biv’), (26.17) 


so valid instrumental variables should be correlated with x? and x, but uncorrelated 
with u = (e — Bıxv + Biyv + B v7). Clearly it is not enough that v and e are individ- 
ually uncorrelated with the instruments. This means that the instrumental variable for 
f(x) has to satisfy more stringent properties than in the linear case. 

More generally, Y. Amemiya has shown, using Taylor approximation, that the in- 
strumental variable does not yield consistent estimates for nonlinear errors-in-variables 
models because the residual term involves both measurement error and an observed 
error-contaminated variable. Therefore it is not possible to find an instrumental vari- 
able that is highly correlated with the observed variable but uncorrelated with residual 
term. Furthermore, from a practical viewpoint, it is not easy to verify the validity of 
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an instrumental variable in estimation because of limited information about the latent 
variable (x*) and measurement error. 


26.4.2. Identification Using Replicated Data 


Faced with the difficulty of implementing an IV-type estimation method, there are two 
alternatives. 

The first is to make very strong distributional assumptions about the conditional 
distribution of the unobserved x* given the observed x. Such assumptions, augmented 
by other technical conditions, make it possible to identify the parameters of the model. 
This approach has been followed by Y. Amemiya (1985) and Hsiao (1989), among 
others. 

A second approach is to consider the possibility of having a large number of mea- 
surements of each unobserved x*, denoted x). Then the average of the replicated 
measures for each x* is substituted for the unobserved regressor. Consistent estima- 
tion of the nonlinear regression then follows because the covariance matrix of mea- 
surement errors shrinks to zero as the number of replicates grows; see Y. Amemiya 
(1985). Unfortunately, such a scenario is rarely encountered in econometrics. 

Since there does not exist common structural information in nonlinear measurement 
error models that can be used to identify and estimate regression models, we consider 
some specific nonlinear regression models. 

Hausman, Newey, and Powell (1995) analyze polynomial Engel curves using Con- 
sumer Expenditure Survey data. Their polynomial function is linear in parameters. 
They prove that, under regularity conditions, both an instrumental variable and an ad- 
ditional measurement can be used to obtain consistent and asymptotically normally 
distributed estimates. In this application, an adjacent quarter is treated as a replica- 
tion and an instrumental variable. They further propose that a general nonlinear func- 
tion can be approximated by a polynomial function. However, they admit that the IV 
method cannot be implemented in this case and an additional measure of true regres- 
sors is needed. 

Li (2002) proposes a general two-stage approach to the nonlinear errors-in-variables 
problem, which relies on repeated measurements. In the first stage, based on empirical 
characteristic functions and the inverse Fourier transform, a nonparametric estima- 
tor is obtained for the conditional density of the latent variables. With this estimator 
available, a semiparametric nonlinear least-squares estimator is constructed using a 
minimum distance criterion. He establishes the estimator’s consistency. This estima- 
tor is also robust in the sense that it does not require any knowledge of the functional 
form of the latent variables. Li’s approach can be applied to any nonlinear errors-in- 
variables situation if replicated measurements are available. However, the asymptotic 
distribution of the estimator has not been established. 


26.4.3. Measurement Errors in Dependent Variables 


In a linear regression model the measurement errors in the dependent variable inflate 
the standard errors of regression parameters but do not lead to inconsistency of the 
estimator. In a nonlinear model there are additional consequences. 
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One class of applications has considered misclassification of responses in qualita- 
tive choice models. This has generated a literature on reporting errors. 


Discrete Choice Models 


Poterba and Summers (1995), in a study of the effects of unemployment insurance on 
the duration of unemployment using the CPS data, generalize a probabilistic model to 
allow for misclassification in labor market status transition. Specifically, they focus on 
potential classification errors in three classes: employed, unemployed, and not in the 
labor force. They develop a multinomial logit model with a special feature of the data 
set: that all of the individuals are assumed to correctly report as unemployed in the first 
survey month. Their results show that unemployment insurance increases unemploy- 
ment spells and that correction for labor market status misclassification strengthens the 
apparent effect of unemployment insurance on spell durations. However, their model 
is based on an assumption that the probability of reporting errors is fixed and uncorre- 
lated with individual characteristics, which, as the authors agree, is “unlikely to hold 
in practice.’ Although the authors claim that the parameter estimates are consistent, 
Hausman, Abrevaya, and Scott-Morton (1998) argue that the standard errors are in- 
consistently estimated because of ignorance of sampling variability of the estimated 
error probability and a non-block-diagonal form of information matrix. 

Hausman et al. (1998) propose a parametric method for estimating a binary choice 
model with misclassification. However, their parametric method requires knowledge 
of the error distribution. They emphasize that parameter estimates may be inconsistent 
if the distribution does not have the assumed parametric distribution. They further 
introduce a two-stage semiparametric method. The key condition in the model for 
identification is that the expected value of the observed dependent variable is an in- 
creasing function of the underlying index, which they show is weaker than the condi- 
tion for identification of a parametric model. Compared to the approach of Poterba and 
Summers (1995), theirs is robust in the sense that the misclassification probability is a 
function of individual characteristics. Using the CPS and PSID, they show that serious 
misclassification exists in a job-change variable. 

Klein and Sherman (1997) develop an “Orbit model” (with features of ordered 
choice model and Tobit model) for the estimation of projected demand for a poten- 
tial new video product. They find evidence that potential consumers exaggerate de- 
mand. The Orbit model is a two-stage procedure with the first stage estimating the 
parameters of a standard Tobit model for actual future demand and the second stage 
estimating the mapping function between current projected demand and actual future 
demand. They further establish consistency and asymptotic normality of Orbit esti- 
mators. However, the identification of the model requires the assumption that the pro- 
jected zero demand will be exact zero demand in future as well. This may be a strong 
assumption. 

Hsiao and Sun (1999) use market survey data on the demand for an advanced elec- 
tronic device. They argue that respondents may report biased demands. They propose 
a randomized response model and a one-sided response bias model for overreporting, 
in which different parametric probabilities are assigned to the truth and alternative 
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choices (including the truth) with logit or probit density function for the truly re- 
vealed preference. They find that “there is a substantial response bias in the data and 
the revised market take rates and price elasticities appear more reasonable than the 
estimates obtained based on the assumption that the respondents truly indicate their 
preference.” 


Count Regression 


In the nonlinear count regression context, Cameron and Trivedi (1998) suggest an 
approach for modeling count data subject to probabilistic underrecording. The ap- 
proach generates compound Poisson and negative binomial count models by allowing 
for a binary recording outcome. Specifically, for each single occurrence of an event, 
a Bernoulli trial is used to determine whether the event is recorded. Given a positive 
probability that an event may not be recorded, the distribution of the recorded events 
has a smaller mean and variance than the distribution of the actual events. They fur- 
ther investigate estimation of the models by ML, quasi-generalized pseudo maximum 
likelihood, and moment-based methods. Based on a Monte Carlo study, they find that 
the performance of the ML estimator is good for samples of size 500 or more. 

Jordan et al. (1997) give an application of the errors-in-variables method in the 
Poisson regression model. In a study of death from stomach cancer in five Japanese 
counties, they notice that a covariate (e.g., plasma lycopene level) is unknown and 
is estimated from a randomly chosen collective and, therefore, is subject to sampling 
error. With the assumption that the measurement error is distributed normally, they 
implement a Bayesian technique by obtaining the posterior distributions of the param- 
eters using Gibbs sampling. The results indicate that the corrected model gives more 
accurate estimates of the parameters even when the original sample is small. 


26.4.4. Poisson Regression with Measurement Errors in Covariates 


We now consider in greater detail one specific example of a nonlinear regression model 
with additive measurement errors in covariates. This example illustrates both the con- 
sequences of such measurement errors and also feasible estimation strategies. 

Guo and Li (2002) have shown that measurement errors in covariates in general 
lead to the overdispersion in the observed data. They also show using Monte Carlo 
simulations that biases will occur if the overdispersion caused by measurement er- 
rors is incorrectly modeled as arising from unobserved heterogeneity. Therefore, one 
should not conclude from the presence of overdispersion that a model with unobserved 
heterogeneity is warranted. 

Stefanski (1989) and Nakamura (1990) propose a corrected score estimator that 
is consistent if measurement errors are present. In particular, Nakamura (1990) gives 
a closed form of corrected score function when the measurement errors are normally 
distributed and replicated data are also available. By contrast, Guo and Li (2002) have 
generalized Nakamura (1990). 
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Measurement Errors and Overdispersion 


In this section, we consider the Poisson regression model in which the discrete random 
variable y follows the Poisson distribution with parameter u = exp(x*' 6), where 3 
is a K x 1 parameter. As is well known, the Poisson regression model has an equi- 
dispersion property that 


ELy|x*] = VLy|x*]. (26.18) 
If the measurement errors are additive, then 
X=x' +e, 


where € are assumed to be independent of unobserved latent variable x*, with mean 
zero and variance—covariance matrix Łe. This notation covers the case where all or 
some of the explanatory variables are measured with errors. 

Measurement errors increase dispersion (see Chesher, 1991). This applies to the 
Poisson regression, in the sense that although (26.18) holds for the conditional mean 
and variance of y given x*, conditioning on x changes the result. Instead, we get 
E[y|x] < VLy|x], in part because E[y|x*] Æ E[y|x], and VLy|x*] 4 VLy|x]. 

If g(x*|x) denotes the conditional density of x* given x, then Guo and Li show that 


Efy| x] = I ELyIx*]e(x*|x)dx" 


2 / ELy?[x*]e(x*[xidx* — l. (Ely Dga dx, (26.19) 


and using (26.18) the conditional variance of y given x is given by 


2 
VIIx] = i ELy?Ix*le(x*Ix)dx* — | / Epix" lee podr | . (26.20) 


A comparison of (26.19) and (26.20) shows that the first term inside the brackets of 
(26.19) is the same as the first term in (26.20). Using this Guo and Li show that 


2 
| Ely Teta’) | < [evi rece wax, (26.21) 


which is interpreted to mean that measurement errors lead to overdispersion. 


Estimation of Errors-in- Variables Model 


When x are contaminated by measurement errors ML estimation or NLS based on the 
observables (y, x) does not provide consistent estimates. Replacement of covariate x* 
by x in estimation is referred to as a “naive” model. 

There are two issues to consider. First, why does ML give inconsistent estimates 
when measurement errors are present? Second, is consistent estimation possible? The 
answer to the second question is “yes” if we adopt, following Stefanski (1989) and 
Nakamura (1990), the method of corrected score estimation for the generalized linear 
models. 
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The idea underlying the corrected score estimator is that the conditional distribution 
of the corrected estimate with respect to x, given the true independent variables x* 
and the dependent variables y, is centered around the ML estimate, which provides a 
consistent estimate of the true value of the parameter of interest. 


Inconsistent and Consistent Estimators 


Suppose that N observations (y;, x*),i = 1,..., N, are generated from a Poisson dis- 
tribution with probability mass function 


eri Bo) ui (Bo) 
yi! 


where ui(Bo) = exp(3}' Bo). Given observations (y;, x*), i = 1,..., N, the MLE B is 
consistent since the probability limit of the average log-likelihood function 


Pr[Y; = y;|x*] = 


’ 


plim NT! InL(B) = N7! Xoe? + yx" B — In y;!} (26.22) 


= E, e [ -eP + yx” 6B —Iny!] 


is maximized at GB = Jo. 

Suppose we observe x; rather than x*, where x; = x* + e; and e; ~ N[0, Xe] in- 
dependent of x7. Then y;|x; is not Poisson distributed. If one nevertheless uses the 
“naive” Poisson model, the resulting estimator B maximizes 


Q(B) = N! X {eP + yix}B —Inyj}}. (26.23) 


This misspecified log-likelihood function converges to 
plim Q(B) = Ey,.[—e* 8 + yx” B — Iny!] + Ew[—-e* P]Eele]- 1), (26.24) 


which in general is not maximized at B = By. So B is inconsistent for Gp. 
A suitably modified objective function yields consistent estimates. Equations 
(26.22) and (26.24) imply that 


{plim Q(8) — Ex [—e* P \(E.[e° 9] — 1)} = plim N~! In L8). 
This suggests maximizing the objective function 
Q(B) = N! $ {-e*9 + yix,B — Iny;!} — Ee [-e P Eele 7] — 1), 
since Q+ (3) converges to plim N~! In L(8). Now, given independence of x* and e, 
Ex [—e* PRIES] = Bepe tR, 


which is consistently estimated by — N~! X; e™?. It follows after some cancellation 
that maximizing Q*(3) is equivalent to maximizing 


QHH) = N'Y yix; B — In y;!} — Ex Le% P]. (26.25) 
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This yields a consistent estimate of 3). Implementation requires a suitable estimate 
of Ey«[e* 8], which is possible if replicated data are available. If the distribution of 
explanatory variables is specified up to unknown parameters, then these unknown pa- 
rameters can be estimated by the replicated measurements. Therefore, E,« [e*" 6] can 
be estimated. 

The estimator Be that maximizes (26.25) is termed the corrected score estima- 
tor by Guo and Li (2002) because it is the root of the corrected score function 
> (iXi Ex: [x*e* 9]) = 0. Guo and Li also establish the asymptotic normality of 
this estimator. The estimated asymptotic covariance matrix VIB] = = N'A IBA !, 
where 


A = Ep le®’8cx*x"], 
B = N! D; (ix; — Ee le® Pex") Oix — Ee [e® Pex]. 


Nakamura (1990) made the stronger assumption that the measurement errors € are 
normally distributed as M[0, Q]. Then 


exp(x"’ B) = Exix [exp (x3 — (8'23/2))]. 
By the law of iterated expectations 
Ex-[exp(x"B)] = Ex [exp (x8 — (8'28/2))], 


which can be consistently estimated by NT! Y; [exp(x; — (8'03/2))]. Conse- 
quently, for Q(B) in (26.23) the probability limit given in (26.24) reduces to 


plim Q(B) = N7! 2 [y:x;8 — In y;! — exp (x,8 — (8’28/2))]. 


This is the corrected log-likelihood function given in Nakamura (1990). Maximiza- 
tion with respect to @ yields a consistent estimate of Go. 

Nakamura’s approach reminds one of the estimation of the linear regression with 
measurement errors (see (26.14)) given an estimate of the covariance matrix of mea- 
surement errors. As in that case, to maximize Nakamura’s corrected log-likelihood 
function one requires knowledge of Q, the covariance matrix of measurement errors. 
This can come from replicated data. However, if the covariates are predominantly dis- 
crete, then the normality of measurement error is not a sensible assumption. In such 
cases the estimator of Guo and Li is more attractive. 

For the case of multivariate x*, the computation of E[exp(x*’Q@)] is not straight- 
forward, even if the distribution of x* is known, because multiple integrals are in- 
volved. Simulation-based methods (Li, 2002) provide one possible approach to this 
problem. 

Implementation of several other nonlinear errors in variable models also require 
replicated observations; for example, see Hsiao (1992) and Hausman, Newey, and 
Powell (1995). Panel data could provide replicated observations at the level of an indi- 
vidual. For example, consider the case of a scalar regressor x* for which two replica- 
tions of x are available, because x;; = x; + £i; fori =1,..., N and j = 1,2. Then a 
moment-based consistent estimator of o? is 62 = X; (xf + x7, — 2xi1xi2)/2N. Thus 
both the mean and variance of x* can be estimated. 
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26.5. Attenuation Bias Simulation Examples 


Analytical results for the linear model are given in Section 26.2, but results are much 
more difficult to obtain in nonlinear models. Here we present two simulation examples, 
one for the logit model and one for a linear-in-logs model, that illustrate attenuation 
bias in nonlinear regression with measurement error in the regressor. 

In the first example, the dgp is the logit model with 


x* ~ U[0, 1], £ ~ logistic, 
_ foify* <0, 
~ | lify* > 0. 


The complication is that x* is measured with error, so that 


x=x* +0, 


v ~ NTO, oĉ]. 


Since x* ~ U [0, 1] it has variance o2, = 1/12, and the noise-to-signal ratio is s = 
1202. A logit regression of y on x rather than of y on x* is estimated. 

To conduct a simulation exercise we carry out a logit regression of y on x, for six 
different values of the noise-to-signal ratio including the value of zero, which bench- 
marks the model. The sample size is fixed at 1,000, and 100 simulation replications 
are used. 

Table 26.1 shows the average values of (@, P) in 100 replications, where @ and B 
are the estimated intercept and slope from logit regression of y on x, rather than the 
correct logit regression of y on x*, for sample size N =1,000 and for six different 
values of o? leading to six different noise-to-signal ratios s. The first column with 
s = 0 benchmarks the model. Recall that for OLS linear regression in the same setup 
the multiplicative bias in the slope coefficient is 1/(1 + s), or 0.96, 0.8, 0.5, 0.2, and 
0.1, respectively. Here the biases have a similar direction, except for logit regression 
they are considerably larger. 

The second example is a bivariate linear-in-logs multiplicative model with a = 
4, 6 = 0.4, and additive measurement errors in both variables. In this case the setup is 


Table 26.1. Attenuation Bias in a Logit Regression with Measurement Error 


Noise/Signal 0 0.04 0.25 1 4 9 
Average a 0.785 1.062 1.406 1.548 1.570 1.596 
Average ß 1.799 1.224 0.446 0.125 0.037 0.012 
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Table 26.2. Attenuation Bias in a Nonlinear Regression with Additive 
Measurement Error 


o2/02. 0.00025 0.0025 0.025 0.25 2.5 25 
Average B 0.393 0.383 0.341 0.217 0.063 0.020 
as follows: 


y* = 4x*""u, u ~ N[10, 0.0001], 
x* = 100 + U[0, 1], 

yY =y" + £y, £y ~ NTO, o], 
x=x* +e, & ~ N[0, o7]. 


In the simulation the sample size is 1,000, and the number of replications is 100. 
We vary the value of the variance of x* from experiment, to experiment, resulting in 
the following values of org lore 0.001, 0.01, 0.1, 1, 5, 10, 50, 100, 1,000, and 5,000. 

The upper row of Table 26.2 gives the average values of slope coefficients across 
different experiments in which the noise-to-signal ratio varies. Once again the attenu- 
ation bias is obvious. 

Both examples produce results that are consistent with the hypothesis underlying 
the “Iron Law of Econometrics.” 


26.6. Bibliographic Notes 


Wansbeek and Meijer (2000) is the most up to date and comprehensive work on measure- 
ment errors written from an econometric perspective. It covers in depth most of the topics in 
this chapter, with emphasis on linear models. The authors also include several chapters link- 
ing measurement error models with factor models, latent variable models, and structural equa- 
tion models. In discussing results the authors eschew the phrase “it can be shown” in favor of 
deriving them in detail. Again from the econometric perspective Hausman (2000) provides a 
survey of the recent results obtained in his and his collaborator’s research. Bound, Brown, and 
Mathiowetz (2001) for a survey of measurement error issues in labor markets. 

The topic of measurement errors is well established in the statistics literature. Fuller (1987) 
is a useful reference; see, in particular, his treatment of the orthogonal regression approach 
that is applicable when the noise-to-signal ratio is known. Although our analysis of the linear 
model is very standard in the econometrics literature, the reader should also be aware of the 
alternative Berkson error model, in which the unobserved true variable is assumed constant 
but the imperfectly measured variable is subject to error, and the nonclassical measurement 
error model discussed in Angrist and Krueger (1999). Madansky (1959) provides a survey of 
numerous early results and approaches. See also Stefanski (2000). 


26.2 Panel data models with measurement errors are analyzed in Bjorn (1992). 

26.3 The intriguing topic of reverse regression is analyzed by Goldberger (1984) and Greene 
(1983) in their commentary on Conway and Roberts (1983). Leamer (1978) provides 
an insightful discussion of reverse regression from a Bayesian perspective. Hahn and 
Hausman (2002) use the reverse regression idea to construct a specification test for the 
validity of the IV approach to the measurement error problem. The concern is that the 
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available instruments may be weak, leading to poor estimates. The Hahn—Hausman idea 
is to carry out IV estimation of the direct regression in which the mismeasured variable 
appears on the right-hand side of the equation. The reverse regression has the same mis- 
measured variable on the left-hand side. This regression is estimated also by instrumental 
variables using the same instrumental variables as the direct regression. 

The literature on measurement errors in nonlinear models is more diffuse. Y. Amemiya 
(1985) is especially useful to econometricians. From a statistical viewpoint, Carroll et al. 
(1995) consider nonlinear models, especially in the generalized linear class, with additive 
measurement errors in regressors, using a variety of methods, including a number that 
can be used if replicated data are available. Li, Trivedi, and Guo (2003) develop and 
apply a measurement error variable model in which the counted response variable has 
measurement error. 


Exercises 


26-1 Consider the attenuation bias result for the slope parameter of the bivariate 


errors-in-variables model (Equation (26.9) in Section 26.2.3). Extend the model 
to include an intercept term. 


(a) Derive a parallel result for the measurement error bias of the intercept term. 
(b) Derive a parallel identification-by-bounds result for the least-squares inter- 
cept estimate, similar to Equation (26.12) in Section 26.3.1. 


26-2 (Adapted from Bollinger, 2003) Consider a linear multiple regression model 


with scalar regressor x that is measured with error and a vector of other regres- 
sors z that are free of measurement error. 


(a) Maintaining the assumptions regarding measurement errors in the bivari- 
ate errors-in-variables model, extend the attenuation bias result and the 
identification-by-bounds result to this case. 

(b) Check that the new results specialize to those for the bivariate case. 


26-3 (Adapted from Wansbeek and Meijer, 2000) Consider the quadratic regression 


model y= a + Bx* + yX? + e, where the regressor x* = x + v, with x observed 
and va measurement error. Assume that (x*, £, v) are mutually uncorrelated and 
normally distributed and that all variables have zero mean. 


(a) Compare the bias of the least-squares estimator of £ and y. 
(b) Is the model identified? Compare the latter result with that from the bivariate 
linear errors-in-variable model. 


26-4 The literature on intergenerational mobility uses the following model (Solon, 


1992; Zimmerman, 1992): 
yen = a+ py ae gon. (26.26) 


I 
with s; ~ iid [0,02]. Here Y is a measure of permanent status (such as per- 
manent income) and £ measures the degree of regression toward the mean in 
economic status. Suppose that permanent status is not observed. Instead, cur- 
rent status Yj; is observed with Yj; = Y; + y Xit + wit, So that Yj; is composed 
of an individual fixed effect Y;, referred to as the permanent status, a system- 
atic factors Xit, and a transitory error component w;;. Let 7 denote the fitted 
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least-squares coefficient, and let 
Vit -—VXit = Yit(y —V) Xit + wit = Y + Vie. 

(a) Let Y/#ther = 7-1-7, yfer denote an average of father’s status used as 
the independent variable, a proxy, for the unobserved permanent status in 
(26.26). Let avg denote the corresponding regression coefficient. Show that 
plim Bayg = BPy, where Py = of /(of + T-'o2). 

(b) Assume that the transitory component of father’s earnings follows an autore- 
gressive scheme, vif" = pv?" + &, where & ~ NTO, of], i= 1,..., T. 
Show that now plim Basa = 6 P4, where P% =o08/(02+T-'V) and V= 
ofITA — NIA + 20{T — 1- 7/0 — TA — p). 
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Missing Data and Imputation 


27.1. Introduction 


The problem of missing data in survey data is one of long standing, arising from 
nonresponse or partial response to survey questions. Reasons for nonresponse include 
unwillingness to provide the information asked for, difficulty of recall of events that 
occurred in the past, and not knowing the correct response. Imputation is the process 
of estimating or predicting the missing observations. 

In this chapter we deal with the regression setup with data vector (y;, x;), i = 
1,..., N. For some of the observations some elements of x; or of both (y;, x;) are 
missing. A number of questions are considered. When can we proceed with an anal- 
ysis of only the complete observations, and when should we attempt to fill the gaps 
left by the missing observations? What methods of imputation are available? When 
imputed values for missing observations are obtained, how should estimation and in- 
ference then proceed? 

If a data set has missing observations, and if these gaps can be filled by a statistically 
sound procedure, then benefit comes from a larger and possibly more representative 
sample and, under ideal circumstances, more precise inference. The cost of estimating 
missing data comes from having to make (possibly wrong) assumptions to support a 
procedure for generating proxies for the missing observations, and from the approxi- 
mation error inherent in any such procedure. Further, statistical inference that follows 
data augmentation after imputed values replace missing data is more complicated be- 
cause such inference must take into account the approximation errors introduced by 
imputation. 

Gaps in data as the result of survey nonresponse and attrition from panels occur 
frequently. Imputation of missing values may be done by agencies for creating and 
maintaining the public-use survey databases or by those who use the data for model- 
ing. In the former case the agency may have more extensive information, including 
confidential information, that can be harnessed in the imputation process. In the latter 
case the modeler may have a specific modeling framework that can be exploited in the 
imputation process. In both cases model-based imputation procedures are feasible. 
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A: Univariate missing data pattern 
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C: General pattern of missing data 
Figure 27.1: Missing data: examples of missing regressors. 


An interesting example of missing data arises in the context of the Survey of Con- 
sumer Finances (Kennickell, 1998). Because of the sensitivity of the issue of consumer 
finances the survey exhibits many gaps in information on income and wealth. Analysts 
at the U.S. Federal Reserve have developed and implemented complex imputation al- 
gorithms for continuous and discrete variables using both publicly available survey in- 
formation on income and wealth as well as confidential information from census data. 

Figure 27.1 shows some potential patterns of missing data on the regressors. The 
data set has a scalar dependent variable y and three regressors: x1, x2, and x3 for each 
observation, then stacked as (y, X4, X2, X3). In panel A, there are complete data on 
(y, X2, X3) but a block of observations on x; are missing. In panel B there are complete 
data on (y, x3) but there are missing blocks of data on (x,, X2) such that x; and x» 
are never simultaneously observed. In panel C there is a general pattern of missing 
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observations with missing observations on all three regressors, but there is no particular 
pattern of missingness. 

The simplest way of handling missing data is to delete them and analyze only the 
reduced sample of “complete” observations. For example, in the case of panel A, the 
complete sample would be the subset of (y, x,, X2, X3) formed by all available data on 
x, and the corresponding observations on (y, X2, X3). In the case of panel B, however, 
following this approach one would leave no usable observations, unless one excluded 
(x1, X2) from the analysis. In panel C the complete data set is formed after deleting any 
observation that contains a missing data point on any of the three regressors. 

The procedure just described is called listwise deletion. It is widely followed and is 
often a default option in statistical software. It is not necessarily innocuous; the conse- 
quences depend on the missing data mechanism, and the conclusions drawn from such 
studies might be seriously flawed. Of course, in general throwing away data means 
throwing away information, and that reduces efficiency in estimation. Hence, provided 
the gaps attributed to missing data can be filled without creating distortion, listwise 
deletion seems worth trying. This chapter will study alternative approaches and their 
limitations. 

Broadly, there are two approaches to imputation, one that is model-based and one 
that is not. The modern approach prefers model-based approaches. These use a model 
to impute the missing observations and then use the subsequent full data set to obtain 
better estimates of the model parameters. The process is iterative. Single and multiple 
imputation are feasible. A key feature of the modern approach is to regard missing 
data as random variables and then to replace them with multiple draws from the as- 
sumed underlying distribution; the process is called multiple imputation. Simulation 
methods may be used to approximate such a distribution. 

This topic warrants a separate short introductory chapter as imputation is an impor- 
tant aspect of microeconometric work. Survey data inevitably include missing data, 
and the common practice of listwise deletion is an imputation method. Better im- 
putation methods are available. An important caveat, however, is that all imputation 
methods are based on assumptions that in some applications may be too strong. 

Most of the chapter deals with model-based approaches. Section 27.2 provides an 
introduction to the terminology and assumptions that are firmly entrenched in the im- 
putation literature. Section 27.3 gives a brief treatment of missing data methods that 
do not use models. Section 27.4 begins with the first of the model-based methods, 
maximum likelihood. Section 27.5 considers the regression framework and EM-type 
methods of imputation. Sections 27.6 and 27.7 present approaches to imputation us- 
ing the Bayesian concepts of data augmentation and MCMC. Section 27.8 provides 
an illustrative example. Sections 27.6—27.8 provide a nice application of the Bayesian 
methods of Chapter 13. 


27.2. Missing Data Assumptions 


Some of the basic terminology and formal definitions widely used in the impu- 
tation literature are due to Rubin (1976), who introduced two key missing data 
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mechanisms, missing at random and missing completely at random, that serve as useful 
benchmarks. 

Rubin’s setup involves Y, an N x p matrix consisting of a complete data set, 
which may not be fully observed. Denote by Yo»s the observed part and by Y mis the 
nonobserved (missing) part. In the context of a regression model Y refers to both 
the regressors and the response (dependent) variables. Therefore, the analysis covers 
the general case of missing data. Let R denote an N x p matrix of indicator variables 
whose elements are zero or one depending on whether corresponding values in the Y 
matrix are missing or observed. 

For regression with single dependent variable, Y contains data on the response vari- 
able y and the (p — 1) regressors X . The probability that xxi, the ith observation on 
variable xg, is missing may be (i) independent of its realized value, (ii) dependent on 
its realized value, (iii) dependent on x,;, j Æ i, or (iv) dependent on. xj, j Ai, 1 Ak. 

Assumptions about the structure of missingness follow. 


27.2.1. Missing at Random 


Suppose x; (i = 1,..., N) is an observation on a variable in the data set under study. 
The missing at random (MAR) assumption is that the “missingness” in x; does not 
depend on its value but may depend on the values of x; (j 4 i). Formally, 


x; is MAR = Pr[x; is missing | x;,x; Y j Æi] (27.1) 
= Pr[x; is missing | x; Y j Æi]. 


After controlling for other observations on x, the probability of missingness of x; is 
unrelated to the value of x;. 

Rubin’s (1976) even more formal definition states the following: The MAR assump- 
tion implies that the probability model for the indicator variable R does not depend on 
Y mis, that is, 


Pr [R | Yobs; Y mis» yY ] = Pr [R | Yovs, yp ] ’ 


where ~ is the underlying (vector) parameter of the missingness mechanism. 

Under MAR no nonresponse bias is induced in a likelihood-based inference that 
ignores the missing data mechanism, although the resulting estimates may be in- 
efficient. If the MAR assumption fails, however, the probability of missingness 
depends on the unobserved missing values. The MAR restriction is not testable 
because the values of the missing data are unknown. Because MAR is a strong as- 
sumption, sensitivity analyses based on different assumptions about missingness are 
desirable. 

A separate issue is whether the pattern of missing data is purely random. In prac- 
tice, we might expect that observations missing inside clusters of data, in the sense of 
Chapter 24, may be correlated. However, this issue is not related to that of nonresponse 
bias resulting from the missingness being connected to data values. 
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27.2.2. Missing Completely at Random 


Missing completely at random (MCAR) is a special case of MAR. It means that Y ops 
is a simple random sample of all potentially observable data values (Schafer, 1997). 

Again suppose x; is an observation on a variable in the data set under study. Then 
the data on x; is said to be MCAR if the probability of missing data on x; depends 
neither on its own values nor on the values of other variables in the data set. 
Formally, 


x; is MCAR > Pr[x; is missing | x;,x; Y j Æi] (27.2) 


= Pr[x; is missing]. 


For example, MCAR is violated if (a) those who do not report income are younger, on 
average, than those who do or if (b) typically small (large) values are missing. 

For cases (i)—(iv) mentioned at the outset in this section, case (i) satisfies both 
MCAR and MAR, cases (iii) and (iv) satisfy MAR, and (ii) does not satisfy MAR. 

MCAR implies that the observed data are a random subsample of the potential full 
sample. If the assumptions were valid no biases would result from ignoring incomplete 
observations, that is, observations with missing values. 

The corollary is that the failure of MCAR implies a sample selection type of bias. 
MAR is a weaker assumption that still aids imputation as it assumes that the missing 
data mechanism depends only on observed quantities. 


27.2.3. Ignorable and Nonignorable Missingness 


A missing data mechanism is said to be ignorable if (a) the data set is MAR and (b) the 
parameters for the missing data-generating process, a, are unrelated to the parameters 
0 that we want to estimate. 

This condition, which is similar to that of weak exogeneity discussed in Chapter 2, 
implies that the parameters 0 of the model are distinct from parameters ~ of the miss- 
ingness mechanism. Thus, if the missing data are ignorable, then there is no need to 
model the dgp for missing data as an essential part of the modeling exercise. MAR and 
“ignorability” are often treated as equivalent under the assumption that condition (b) 
for ignorability is almost always satisfied (Allison, 2002). 

A nonignorable missing data mechanism arises if the MAR assumption is violated 
for (y, x), but it would not be violated if MAR is violated only for x. In that case 
the dgp for missing data must be modeled along with the overall model to obtain 
consistent estimates of the parameters 0. To avoid the possibility of selection bias, 
estimators such as Heckman’s two-stage procedure (see Chapter 16) must be used. 

The imputation literature focuses on ignorable missingness. If additionally the data 
set is MCAR then missing data cause no problem, aside from efficiency loss that might 
be reduced by imputation. If instead the data set is only MAR then imputation methods 
may be needed to ensure consistency, as well as to increase efficiency. 
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27.3. Handling Missing Data without Models 


If no models are to be used, then one can simply analyze the available data or one can 
analyze data after non-model-based imputation. 


27.3.1. Using Available Data Only 


Listwise deletion or complete case analysis means the deletion of the observations 
(cases) that have missing values on one or more of the variables in the data set. Under 
the MCAR assumption, the remaining sample after listwise deletion remains a random 
sample from the original population; therefore the estimates based on it are consistent. 
However, the standard errors will be inflated because less information is used. If the 
number of regressors is large, then the total effect of listwise deletion can lead to very 
substantial reduction in the total number of observations. This might encourage one to 
leave out of the analysis variables with a high proportion of missing observations, but 
the results generated by such practice are potentially misleading. 

If MCAR is not satisfied and the missing data are only MAR, then the estimates will 
be biased. Thus listwise deletion is not robust to the violations of MCAR. However, 
listwise deletion is robust to the violations of MAR among the independent variables 
(regressors) in regression analysis, that is, when the probability of missing data on any 
regressor does not depend on the values of the dependent variable. Briefly, listwise 
deletion is acceptable if incomplete cases attributable to missing data comprise a small 
percentage, say 5% or less, of the number of total cases (Schafer, 1996). It is important 
that the sample after listwise deletion is representative of the population under study. 

Pairwise deletion or available-case analysis is often considered a better method 
than listwise deletion. The idea here is to use all possible pairs of observations (x1;, x2;) 
in estimating joint sample moments of (x1, x2) and to use all observations on an indi- 
vidual variable in estimating marginal moments. Thus, in a linear regression, under 
pairwise deletion we would estimate (X’X) and (X’y) using all possible pairs of re- 
gressors, whereas under listwise deletion we would estimate the same after deleting 
all cases with any missing observations. It is clear that we lose less information un- 
der pairwise deletion. The proposal here is to use maximum information to estimate 
individual summary statistics such as means and covariances and then to use these 
summary statistics to compute the regression estimates. 

There are two important limitations of pairwise deletion: (1) Conventionally es- 
timated standard errors and test statistics are biased and (2) the resulting regressor 
covariance matrix (X’X) may not be positive definite. 


27.3.2. Imputation without Models 


There are a number of ad hoc or weakly justified procedures often implemented in 
statistical software. 

Mean imputation or mean substitution involves replacing missing observations 
by the average of the available values. It is mean-preserving but will have impact on the 
marginal distribution of the data. It is obvious that the probability mass in the center 
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of the marginal distribution will increase. It will also affect the covariances and corre- 
lations with other variables. 

Simple hot deck imputation involves replacement of the missing value by a ran- 
domly drawn value from the available observed values of that variable, somewhat like 
a bootstrap procedure. It preserves the marginal distribution of the variable, but it dis- 
torts the covariances and correlations between variables. 

In a regression setting neither of these two well-known approaches are attractive 
despite their simplicity. 


27.4. Observed-Data Likelihood 


The modern approach to missing data is to impute values for missing observations by 
making single or multiple draws from the estimated distribution based on the pos- 
tulated observed data model and the model for the missing data mechanism. The 
Bayesian variants of this procedure make the draws from the posterior distribution, 
which uses both the likelihood and the prior distribution of the parameters. 

The first important issue involves the role played by the missing data mechanism 
in the imputation procedure and especially whether the missing data mechanism is 
ignorable. 

Let @ denote the parameters of the dgp for Y = (Y sbs» Ymis) and let a denote the 
parameters of the missing data mechanism. For convenience of notation it is assumed 
that (Ys, Ymis) are continuous variables. Then the joint distribution of (R, Yops) is 
given by 


Pr[R, Yol8, Y] = f Pr[R, Yos, Ymisl 0, Y] dYnis (27.3) 
= f Pr [R| Yobs, Ynis; y] Pr [Y obs, Y misl0] dY mis 


= Pr [R] Yovs, a] f Pr [Y obs; Y misl0] dY mis 
= Pr [R| Yobs; yp] Pr [Y ovs18] : 


The first equality derives the joint probability of (R, Y.p;) by integrating out (or aver- 
aging over) Y mis from the joint probability of all data and R. The second line factors 
the joint probability into conditional and marginal components, the conditioning being 
with respect to Yops and Ymis. The third line separates the missing data mechanism 
from the observed data mechanism; this step is justified by the MAR assumption. The 
last line means that 0 and ~ are distinct parameters and hence inference about @ can 
ignore the missing data mechanism and depends on Yop; alone. 
The observed-data likelihood is proportional to the last factor in the fourth line: 


LIOIY ous] < Pr[Yovs!4] . (27.4) 


It involves only the observed data Y,p; even though the parameters @ appear in the 
dgp for all observations (observed and missing). As in Chapter 13, the constant of 
proportionality does not appear in (27.4). 


929 


MISSING DATA AND IMPUTATION 


Under the MAR assumption the joint posterior probability of (0, 7) is written as 
the product of Pr[R, Yops|9, W% ] and the joint prior distribution 7(0, 7) as follows: 


Pr[O, WY ops, R] = kPr[R, Yousl9, Y lr, w) (27.5) 
x Pr[R]Yoos,  ] Pr[Yovs|]7(8, w) 
x Pr[R]Yous, Y ] Pr[Yovs|9] m00) my), 


where k in the first line is a constant of proportionality free of (0, ¢#). The second 
line uses the factorization given in (27.3), and the third line uses the assumption of 
independent priors for 0 and w. 

As our main interest is in 0, we derive the marginal posterior for O by integrating 
out ¢ from the joint posterior. This yields the observed-data posterior 


Pr[4|Yop.. R] = f Pr{9, lY os, Rides (27.6) 


o Pr [Yos] 8 17900) f Pr[R|Yongs Y 1 wy (hyde 
x LLOlY 45] (8), 


where the second line separates @ and ~, and the last line absorbs the integral expres- 
sion into the constant of proportionality. Therefore, the last line does not involve % 
and is independent of the missing data mechanism R. 


27.5. Regression-Based Imputation 


In this section we consider a least-squares based imputation. The key component is 
use of the EM algorithm, previously introduced and discussed in Section 10.3.7. 

The EM algorithm consists of the expectation step and the maximization step. The 
structure of the EM algorithm is closely related to Bayesian MCMC and data aug- 
mentation methods. Therefore, rather than providing a fully operational method for 
handling missing data, we will introduce an example that brings out the motivation be- 
hind modern multiple imputation techniques and suggests the major features of such 
an approach. 


27.5.1. Linear Regression Example with Missing Data 
on a Dependent Variable 


In practice one can have missing observations on dependent (endogenous) variables 
and/or explanatory variables. We consider a regression example that has missing data 
on the dependent variable, with 


ARHAN] an 
Ymis Xo u2 
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where E[u|X] = 0 and E/uu’|X] = oly. The complication is that a block of obser- 
vations on the dependent variable y, denoted Ymis, is missing. We assume that the 
available complete observations are a random sample from the population, so that the 
missing data are assumed to be MAR though not MCAR. 

Given the MAR assumption and N; > K, the first block of N; observations can be 
used to consistently estimate the K -dimensional parameter 3 and o°. The maximum 
likelihood estimates of (B, o?) under Gaussian errors are B= = [XX i'Xiy1 and s? = 
Qı- X18 (yi — Xı Xið) /N1ı. By standard I theory, and under the normality assumption, 
Aldata ~ N{B,o7[X,Xi]-!] and s?/07/B ~ (Ni — K)x -x 

First, consider a naive single-imputation procedure for generating the missing ob- 
servations. Conditional on X3, the predicted values of ymis, denoted Fmis, are given by 
X, B, where Bi is the preceding estimate obtained using only the first N, observations. 
Then 


E [YmisXo] = Fmis = X28, (27.8) 
V [mis] = VIPIX = sAm, + X2 [XX1] X), 


where s*Iy, is an estimate of Vju,}. 

In the naive method one would generate the N2 predicted values of Ymis, and then 
apply standard regression methods to the full sample of N = N; + N2 observations. 

The two steps in the naive method correspond to the two steps of the EM algorithm. 
The prediction step is the E-step, and the second-step application of least squares to 
the augmented sample is the M-step. 

However, this solution has flaws. First, consider the data augmentation step. Be- 
cause the generated values Yi; lie exactly on the least-squares fitted plane, the addi- 
tion of (Ymis, X2) to the sample to produce a new estimate, B a» does not change the 
previous estimate B: 


Ba = [X) Xi + X, X] [Xiyi + Xl Smis] 
= [XiX + XX] [X1x B+X,X] 
= 3. 


Second, the estimate of o? obtained by the standard formula to the residuals from 
the augmented sample yields an estimate that is too small because the additional N2 
residuals are zero by construction, 


si = (y —XB,)(y — XB,)/N (27.9) 
= (yı — X1B)'(y1 — X1B)/N < ° 


where s? correctly divides by N; rather than N. 

Finally, as can be seen from the expression for the sampling variance of Fmis, the 
generated predictions are heteroskedastic, unlike the y;, and hence the variance of B r 
cannot be estimated using the least-squares formula in the usual way. The observations 
Ymis are draws from a distribution with a different variance. The naive method does not 
make allowance for the uncertainty attached to the estimates of Ymis. 
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To fix these problems modifications are needed. First, the estimation of Yi; should 
take account of uncertainty regarding B. This may be done by adjusting mis and 
adding some “noise” to the generated predictions such that the estimates of missing 
data more closely mimic a draw from the (estimated or conditional) distribution of 
yı. A standardization step can use the fact that an estimate of V[¥mis] , V, is avail- 
able from (27.8). Hence the components of the transformed variable voy 2S nis have 
unit variance. To mimic the distribution of y;,we can make a Monte Carlo draw from 
N[0, s?] distribution and multiply it by P- F mis- 

The revised algorithm is as follows. 


1. Estimate 6 using the N; complete observations as before. 
2. Generate Fmis = Xð. 


3. Generate adjusted values of Y2 i, = V- Smis) © Um of Ymis, where u, is a Monte Carlo 
draw from the V [0, s°] distribution and © denotes element-by-element multiplication. 


4. Using the augmented sample obtain a revised estimate of 6. 


5. Repeat steps 1—4 where in step | the revised estimate of B is used. 


The revised algorithm, also an EM-type algorithm, continues until it converges in 
the sense that the changes in the coefficients or the changes in regression residual sum 
of squares become arbitrarily small. 

To make connection with later discussion we give the algorithm a different interpre- 
tation. Step 3 is a draw from the conditional distribution of y given Ø, and step 4 is a 
draw from the conditional distribution of 3 given s?, X. The approach may be refined 
further by adding a step that involves a draw from the distribution of s?. We do not 
go through all the steps of this approach because they will become clearer in our later 
discussion of imputation. 

Alternative models for missing data on the dependent variable were presented in 
Chapter 16. These relaxed the MAR assumption and specified nonignorable missing- 
ness. Then the preceding EM algorithm leads to inconsistent estimation of 3. The cen- 
sored Tobit model specifies that data are missing for observations with x‘G + u < 0 
and a consistent estimator is the Tobit MLE (see Section 16.3). Amemiya (1985, 
pp. 376-378) details the EM algorithm for the Tobit model. 


27.6. Data Augmentation and MCMC 


The general structure of the Bayesian approach to missing data is to use the following 
type of iterative algorithm that uses imputation and prediction steps. 
The imputation step (I-step) makes a draw from the conditional predictive distri- 
bution of Y mis. Given an rth round estimate, 
YELO ~ Pr[Y mis! Yos, 0]. (27.10) 
This expression denotes a random draw of YC+D from the predictive conditional dis- 
tribution of Y mis given the current estimate 9” and the observed data Yop. Notice that 
Y mis 1S in general a matrix so that this notation refers to (in principle) a series of draws. 
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The prediction step (P-step) is executed by making a draw from the complete data 
posterior 


OCD ~ Prf[OlYoos, YEP]. (27.11) 


mis 


That is, Yobs is augmented by an imputed value Yt!) drawn from the predictive dis- 
tribution of Y mis, and a draw is made from the posterior distribution of 0. The steps 
(27.10) and (27.11) can then be repeated. 

Sequential sampling from the two distributions generates a Markov chain. This pro- 
cess, which strongly resembles the EM algorithm, is essentially the Gibbs sampler of 
Section 13.5.2, but in the missing data literature it is referred to as data augmentation. 
Under appropriate conditions, and by a theorem cited in Section 13.5.1, the sequen- 
tial draws will converge to a stationary distribution for a sufficiently large value of r, 
which is the length of the chain. When the chain is terminated we have one imputation 
of Y mis- Then we can regard 0? as an approximate draw from Pr[O/Y,,,] and YU) 
as an approximate draw from Pr[Y mis|Y obs]. AS with any MCMC application the chain 
has to run sufficiently long to ensure that successive imputations are free of statistical 
dependence. These issues have been discussed in Chapter 13. 

After convergence we would have accomplished the joint objectives of imputing 
the missing values based on the model specified for the data and estimating the model 
using both observed and imputed values. Postconvergence we would have the data 
necessary to compute the posterior moments of @ and any interesting functions of 0 
and Y using the ideas discussed in Chapter 13. 

As a specific illustration of this procedure we reconsider the missing data re- 
gression example of the previous section. The steps in the MCMC algorithm are 
as follows: 


. Using observed data calculate B= [Xix] X yı, andu= (yı — Xið). 
. Generate o? as T divided by a draw from xĝ, _ x distribution. 

. Draw Blo? ~ NIB, o? [XiX] 1. 

. Draw Ymis ~ N[X23,07]. 


. Using y instead of yı, and X instead of X,, repeat steps 1—4 after appropriate adjust- 
ments. 


a A U N = 


The justification for step 2 is that, under an uninformative prior for (3,07), the con- 
ditional posterior distribution of W/o? is XM- x if only the observed data are used. 
After data augmentation this changes to x4 _g. The justification for step 3 is that, un- 
der an uninformative prior, the conditional posterior distribution is M [B, o* [XX 1171]. 
After data augmentation this changes to M [B. o?[X’X]~ 1J, Step 4 is the impu- 
tation step using the conditional predictive density M [X2, o°]. These steps can 
be appropriately modified if we use, for example, an informative normal—-gamma 
prior for (G,o07). The conditional posterior distributions for this case are given in 
Section 13.3. 
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27.7. Multiple Imputation 


The analysis of the preceding section explains how a full MCMC run will generate 
a single imputation. However, a single imputation does not adequately handle the 
missing-data uncertainty. This is the essential rationale for using a multiple imputa- 
tion procedure. The conditional predictive distribution of Y mis|Yobs, 9 is obtained by 
averaging over the observed-data posterior of 0: 


Pr[Yinis] Y obs] = f Pr[Y mis| Y obs; 0]Pr [0Y ons] dð. 


Proper multiple imputations from a Bayesian viewpoint reflect uncertainty about Y mis, 
given the uncertainty about parameters of the model. 

After multiple imputation the missing data Y mis are replaced by simulated/imputed 
values YO) YO) YG), ..., ¥". Each of the complete data sets is then analyzed as 
if it were complete. The results from the m analyses will show variation that reflects 
the uncertainty resulting from the missing data. With m different data sets questions 
arise about how one should determine an appropriate value for m and how the m 
sets of parameter estimates and covariance matrices should be combined. We address 
both of these questions using results from the literature but without providing detailed 
justification. 

In considering how to combine the results based on multiply imputed data the key 
result, stated for an arbitrary statistic Q, is 


Pr[Q LY obs] = [rio | Y mis, Y ovs] Pr [Y mislY ovs] dY nis, (27.12) 


which states that the actual posterior distribution of Q, is obtained by averaging over 
the complete-data posterior distribution of Q. This means averaging over the results 
of multiple imputations of missing observations (Rubin, 1996). 

Equation (27.12) implies that the final estimate of Q is given by the law of iterated 
expectations, 


ELQ|Y ops] = ELELO |Y obs; Ymis]] Y obs]. (27.13) 


The posterior mean of Q is the average of Q, using complete data after repeated 
imputation of missing data. 
The final variance of Q is given by the formula 


VIQ|Y obs] = EIVI QY obs, Ymis]] Yous] + VIELQ|Y obs; Ymis]|Y obs], (27.14) 


using the variance decomposition formula given in Section A.8. 

Rubin (1996) also gives the following rules for combining moment information, 
stated in terms of a scalar parameter. For an arbitrary scalar parameter, suppose Q, is 
a point estimate at the rth imputation and U, is a variance estimate. Then define the 
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Table 27.1. Relative Efficiency of Multiple Imputation 


Number of Observations Missing (A ) 
Imputations (m) 10% 30% 50% 
3 0.967 0.909 0.857 
10 0.990 0.970 0.952 
20 0.995 0.985 0.975 


averages of the point and variance estimate, respectively, as 


Q=m"')G,, (27.15) 
r=1 

U=m' U, (27.16) 
r=l 


and the between-imputation variance as 


B=(m-1)!) @, - OY (27.17) 
r=1 
and the total variance as 
T=U+(1+m')B. (27.18) 


The results (27.15, 27.16) follow from (27.13); Equation (27.18) follows from 
(27.14). Schafer (1997) gives results for combining p-values and likelihood ratio 
statistics and provides additional references. 

Postimputation inference regarding individual coefficients or subsets of coefficients 
can be carried out using the final estimates, since the standard central limit theorem 
and the associated large-sample results can be extended to cover this case. 

The following is a measure of the relative efficiency of m multiple imputations: 


reff = Q+ A/m, (27.19) 


where å is the fraction of missing observations. Efficiency is measured relative to no 
missing data. The arithmetical calculations in Table 27.1 show that with as few as three 
imputations the efficiency can be as high as 97% with 10% missing data, and 86% 
with 50% missing data. With 10 or more imputations the relative efficiency exceeds 
95% with 50% missing data. Thus, as emphasized by Schafer (1997), the number of 
imputations need not be very high. 


27.8. Missing Data MCMC Imputation Example 


This section gives two illustrative applications of missing data imputation: the model- 
free methods of listwise deletion and mean imputation (see Section 27.2), and the 
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model-based method of data augmentation using the MCMC algorithm (see Sec- 
tion 27.6). Only data on regressors are missing and the missing mechanism is MAR. 

The first application involves simple multiple regression, and the second involves 
a logit regression. For clarity and simplicity we use artificially generated data with a 
known dgp. 


27.8.1. Linear Regression with Missing Data on Regressors 


For the linear regression example the dgp is 
yi = Bo + Bix + Box + ui, i= 1,2,...,N, (27.20) 


with u;|x1;, x2; ~ NO, o°] and (x4;, x2;) bivariate normally distributed with 


re sarf i |; alk (27.21) 


so that x9;|x1; ~ N [oxi 1 — p°]. Also, we set B’=[1 1 1], N =1,000, and the 
proportion of randomly missing data on x; and x to either 10% or 25%. For any i, 
either xı or x2, or both, may be missing. We also use two different values of p, 0.36 
and 0.64. 

For the Markov chain we use 500 iterations for the burn-in phase. The Markov chain 
calculations are implemented using the SAS MI Proc algorithm, which uses an unin- 
formative prior. For demonstration purposes only, the number of imputations is fixed 
at 10 but the length of the chain after the burn-in phase varies from 10 to 10,000. Proc 
MI combines the results from multiple imputations using Equations (27.15)—(27.18). 

Tables 27.2 and 27.3 present results for high p and low and high rates of missing 
data. There are no dramatic differences among methods. Because the MAR assump- 
tion applies, point estimates from listwise deletion and the full sample remain close, 
but as expected the standard errors are larger under listwise deletion. Under mean im- 
putation the point estimate of 62 diverges relatively more, but the observed variation 
is well within the bounds of sampling error. It appears that in both cases the Markov 
chain attains stationarity rather rapidly, there being very little difference between the 


Table 27.2. Missing Data Imputation: Linear Regression Estimates with 10% 
Missing Data and High Correlation Using MCMC Algorithm 


No Data Listwise Mean Length of the Markov Chain 
Missing Deletion Impute 10 1,000 5,000 10,000 


Ba 0.919 0.913 0.899 0.910 0.911 0.909 0.903 
© (0.104) (0.113) (0.105) (0.102) (0.101) (0.103) (0.101) 
B, 1.097 1.067 1.053 1.196 1.205 1.199 1.199 
(0.138) (0.151) (0.141) (0.148) (0.155) (0.144) (0.147) 
Ba 1.000 1.072 1.112 1.042 1.051 1.041 1.055 
(0.132) (0.145) (0.135) (0.140) (0.146) (0.143) (0.146) 
R? 0.240 0.254 0.226 
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Table 27.3. Missing Data Imputation: Linear Regression Estimates with 25% 
Missing Data and High Correlation Using MCMC Algorithm 


No Data Listwise Mean Length of the Markov Chain 

Missing Deletion Impute 10 1,000 5,000 10,000 
Bo 0.919 0.863 0.984 0.899 0.898 0.925 0.900 
x (0.104) (0.167 (0.108) (0.108) (0.105) (0.111) (0.110) 
By 1.097 1.048 1.062 1.028 1.047 1.082 0.987 
T (0.138) (0.167 (0.150) (0.152) (0.166) (0.161) (0.155) 
b2 1.000 1.129 1.156 1.071 1.085 1.024 1.124 

(0.132) (0.161) (0.148) (0.152) (0.144) (0.172) (0.152) 
R? 0.240 0.268 0.203 


results with 10 and 10,000 iterations. This is probably due to having set the number of 
burn-in iterations at 500, which may be higher than needed for this relatively simple 
case. 

In Table 27.4 the simulation exercise is repeated for the “worst-case” scenario of 
low p and 25% missing data. The divergence between the point estimates from the 
full sample and those from listwise deletion and mean imputation cases is overall rel- 
atively greater than that for the MCMC cases. However, even in this case there are 
no really dramatic differences between estimates from the full sample. Once again 
we see that the benefit of running a long Markov chain are not apparent in this 
example. 


27.8.2. Logit Regression with Missing Data on Regressors 


We next consider an example of a nonlinear model with missing data on regressors 
using simulated data. In this simulation example we retain the dgp given before but 
change the dependent variable into a discrete dichotomous variable. First, reinterpret 


Table 27.4. Missing Data Imputation: Linear Regression Estimates with 10% 
Missing Data and Low Correlation Using MCMC Algorithm 


Length of the Markov Chain 


No Data Listwise Mean 
Missing Deletion Impute 10 1,000 5,000 10,000 
Bo 1.121 1.162 1.142 1.149 1.155 1.154 1.141 
(0.099) (0.130) (0.103) (0.104) (0.103) (0.104) (0.101) 
B, 1.099 0.930 1.052 1.026 1.020 1.004 1.044 
(0.107) (0.134) (0.121) (0.127) (0.128) (0.124) (0.124) 
Bo 1.102 1.122 1.215 1.130 1.157 1.137 1.151 
(0.107) (0.134) (0.124) (0.128) (0.129) (0.129) (0.119) 
R? 0.243 0.235 0.186 
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Table 27.5. Missing Data Imputation: Logistic Regression Estimates with 10% 
Missing Data and High Correlation Using MCMC Algorithm 


No Data Listwise Mean Length of the Markov Chain 


Missing Deletion Impute 10 1,000 5,000 10,000 

Bo — 0.447 — 0.498 — 0.439 —0527 — 0.534 — 0.531 —0.539 
(0.070) (0.078) (0.070) (0.073) (0.073) (0.072) (0.073) 

B, — 0.597 — 0.658 — 0.602 —0.620 — 0.673 —0.681 — 0.675 
(0.096) (0.108) (0.098) (0.106) (0.102) (0.101) (0.103) 

Ba — 0.444 — 0.474 —0.523 -0.597 — 0.540 — 0.536 — 0.553 


(0.092) (0.103) (0.094) (0.107) (0.103) (0.099) (0.101) 


the simulation design given for the linear regression example, so y = y*, a latent vari- 
able. Let the dgp be 


y} = Po + Bix + Boxe + ui, i = 1,2,..., N. (27.22) 


Then a dichotomous y; is generated according to the following rule: 


ea ee (27.23) 


Oif yf < 0. 


We will model the probability that y; = 0 using the logit model, even though the dgp 
is that for the probit model. As discussed in Section 14.4.1, the logit model identifies 
the parameter vector 3/o, where the variance o? = 27/3. With all elements of 8 set 
equal to one, the logit model will provide estimates of the true parameter value of ap- 
proximately —0.551. The MCMC estimation is set up as before with a noninformative 
prior. 

Tables 27.5 covers the favorable case with 10% missing data and high correlation 
between x; and x2, and Table 27.6 covers the less favorable case with 25% missing 
data and low correlation between x, and x. 

In the first case, even with no missing data the estimate Ba is substantially off its 
expected value. The MCMC point estimates change somewhat when the length of 
the Markov chain is increased from 10 to 1,000. However, more when simulations 
are implemented, there is only slight change in point estimates, a result that we can 
interpret as an indication of convergence of the chain to its stationary distribution. 

For the second example involving a less favorable simulation design, the results 
are as shown in Table 27.6. The main difference is that the divergence between the 
expected point estimates and the estimated values is somewhat larger for the previous 
case. However, broadly speaking the performance of the multiple imputation method 
in the logistic regression is similar to that in the linear regression. 
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Table 27.6. Missing Data Imputation: Logistic Regression Estimates with 25% Missing 
Data and Low Correlation Using MCMC Algorithm 


No Data Listwise Mean Length of the Markov Chain 
Missing Deletion Impute 10 1,000 5,000 10,000 
Bo — 0.447 — 0.658 — 0.582 — 0.605 — 0.609 — 0.609 — 0.599 
P (0.070) (0.097) (0.070) (0.074) (0.074) (0.073) (0.076) 
Bı — 0.597 — 0.434 — 0.470 — 0.447 — 0.470  — 0.471 — 0.481 
7 (0.096) (0.100) (0.085) (0.090) (0.094) (0.094) (0.082) 
Ba — 0.444 —0593 — 0.648 — 0.634 —0.615 — 0.576 —0.596 


(0.092) (0.108) (0.089) (0.084) (0.086) (0.086) (0.094) 


27.9. Practical Considerations 


A major implication of the analysis of this chapter for practice is that analysis of mul- 
tiply, rather than singly, imputed data has theoretical advantages. Moreover, model- 
based approaches are less ad hoc than mechanical approaches such as mean imputation 
or hot deck. In many realistic applications devising an MCMC-type imputation proce- 
dure may pose a significant challenge, however, compared to the relative simplicity of 
the examples given in the last section. 

A distinction may be drawn between multiple imputations where the end product 
is the data and one in which the end product consists of estimated coefficients for 
inference. Although both procedures may be model based the second may involve 
more complex econometric models. Examples are provided by Brownstone and Valetta 
(1996), Stinebrinkner (1999), Kennickell (1998), and Davey, Shanahan, and Schafer 
(2001). 

Even when the primary object is imputation, without extensive modeling the prob- 
lem may be far from simple. For example, in his study of the 1995 Survey of Consumer 
Finances, Kennickell (1998, p. 5) remarks: 


[When] the survey contains a very large number of variables, there is substantial 
missing or partially missing (range) information, the patterns of missing informa- 
tion are highly heterogeneous, the distributions of many of the variables are highly 
skewed, and the data have a complex structure, [then], analysis of the survey in the 
absence of imputation would be a formidable task. Moreover, anyone using the pub- 
lic version of the data set would lack key frame data that turn out to be important 
for understanding the distributions of the missing data. Thus, even on pure efficiency 
grounds, there is a good case for imputing the missing data. 


Despite the complexity of the problem Kennickell was able to use imputation proce- 
dures similar to those discussed in this chapter. 

Stinebrinkner (1999), also facing a missing data situation in which listwise deletion 
“would leave the econometrician with too little data to estimate the model of interest,” 
develops a two-stage simulated likelihood-based procedure for estimating the joint 
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distribution of the missing data and estimating duration model for the first teaching 
spell. 

For relatively simple cases software such as the SAS package Proc MI may be 
used. S-Plus and SOLAS also provide software support. A helpful guide and survey 
of computer software packages is given in Horton and Lipsitz (2001). For additional 
information see the relevant Web sites. 

Most of the analysis of the chapter is based on assuming an ignorable missing data 
mechanism. From an econometric viewpoint this might be a major simplification. For 
example, see Lillard, Smith, and Welch (1986), who critique the Census hot deck 
method for imputing missing wages. How should one proceed if the mechanism is 
nonignorable? In the notation of Section 27.4, a nonignorable missing data mecha- 
nism would imply that parameters 0 and ~ are not distinct. Then one must specify 
the missing data mechanism explicitly, as in the case of selection models and models 
of attrition bias (see Chapter 16 and Section 23.5.2). Schafer (1997, p. 28) provides 
some relevant references to the literature. 


27.10. Bibliographic Notes 


Important early references include Little and Rubin (1987) and Rubin (1987). Allison (2002) 
provides a relatively nontechnical but lucid introduction to the missing data problem and lit- 
erature. Rubin (1996) provides a survey with historical perspective. Schafer (1997) provides a 
more complete analysis that covers categorical data, mixed discrete—continuous data, and data 
from complex surveys. 


27.2 Meng (2000) provides a historical perspective on the missing data mechanism. 
27.5 Little (1988, 1992) provides a good review of the literature on linear regression with miss- 
ing regressors, covering both non-model-based and model-based approaches. 


Exercises 


27-1 Consider any regression model, linear or nonlinear, with dependent variable y 
and exogenous variables x, and iid errors £. Show that if the probability of miss- 
ing data on x is independent of y, then the regression based on listwise deletion 
will provide a consistent estimate of the conditional mean function. [Hint: Show 
that the conditional distribution of y given x is not affected by missing observa- 
tions.] 


27-2 (Adapted from Gouriéroux and Monfort, 1981). Consider the regression model 
y = 61x+ ZG, +u, where yis an N x 1 vector, Z is an N x K matrix, and x 
is an N x 1 vector of a scalar regressor, some of whose elements are miss- 
ing. Assume that observations are missing at random and E[u|x, Z] = 0 and 
E[uu’|x, Z] = oly. Both y and Z are fully observed. The following approach 
is proposed to deal with the missing data. Assume a linear regression model 
relating x to Z, x = Zy + e, where E[e| Z]=0 and Efee’|Z] = oly. Then let 
J =(Z,Z.]-'Z,x., where the subscript c refers to “complete data.” Impute val- 
ues Of Xm = ZinlZ fol 2X where x,, refers to the missing observations and 
Zm to the corresponding values of Z. The original regression is then reestimated 
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using the full set of N observations after replacing the missing values of x by 

imputed values. 

(a) Explain why the OLS regression estimator based on complete and imputed 
observations might be biased in finite samples. 

(b) What additional assumptions are required to prove that the OLS estimator 
based on complete plus imputed values is consistent? 

(c) Is the OLS estimator efficient? 


Consider the point that when estimation of a model is undertaken after data im- 
putation the precision of the estimates is likely to be overstated if no adjustment 
is made for the imputation step. In other words, imputed data may be regarded 
as generated variables and hence subject to the problem of the sequential two- 
step estimator discussed in Section 6.6. Explain whether an adjustment related 
to imputation of missing data is necessary asymptotically. 
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Asymptotic Theory 


A.1. Introduction 


In this appendix we consider the behavior of a sequence of random variables by as 
N > œ. 

In applications the index N is the sample size and the sequence by is an estimator, 
such as B or é, or a component of an estimator, such as N = ‘a x? or N7! J; xiu; in 
the case of OLS with one regressor and no intercept, or a test statistic. 

For estimation theory it is sufficient to focus on two aspects of the behavior of the 
sequence by as N — œ. First, we consider convergence in probability of by to a 
limit b, a constant or random variable that is very close to by in a probabilistic sense 
defined in the following. Second, if the limit b is a random variable, which may require 
a rescaling of the original sequence, we consider the limit distribution. 

Estimators are usually functions of averages or sums. Then it is easiest to derive 
limiting results by invoking results on the behavior of averages, notably laws of large 
numbers and central limit theorems. The notation used is to consider an average 
Xy = N7! >; Xi, where X; here is generic notation for a random variable being av- 
eraged and should not be confused with the use of x; to denote the regressor vector. For 
example, for OLS with one regressor and no intercept we will apply a law of large num- 
bers to the average of X; = x? and a central limit theorem to the average of X; = x;u;. 

Table A.1 summarizes the definitions and theorems presented in the remainder of 
this appendix. These are stated without proof but with some discussion. The focus is 
on results used to obtain asymptotically normal estimators, the usual case when cross- 
section data are used. Additional results are needed for application to nonparametric 
estimation, to parametric estimation when the support of the data depends on parame- 
ters, and to time series estimation when data have unit roots. 

The first key concept, convergence in probability, is presented in Section A.2. This is 
established using laws of large numbers given in Section A.3. The other key concept, 
convergence in distribution, is presented in Section A.4. Convergence to the normal 
distribution is established using central limit theorems given in Section A.5. Further 
results and common terminology for limit multivariate normal distributions are given 


943 


ASYMPTOTIC THEORY 


Table A.1. Asymptotic Theory: Definitions and Theorems 


Definition Theorem Name Equation 
Al Convergence in Probability (A.1) 
A.2 Consistency (A.2) 

A.3 Slutsky (A.3) 
AA4 Mean-Square Convergence (A.4) 
A.5 Chebychev’s Inequality (A.5) 
A.6 Almost Sure Convergence (A.6) 
A.7 Law of Large Numbers (A.7) 


A.8 Kolmogorov LLN 
A.9 Markov LLN 


A.10 Convergence in Distribution (A.9) 
A.11 Continuous Mapping (A.10) 
A.12 Transformation (A.11) 

A.13 Central Limit Theorem (A.13) 


A.14 Lindeberg—Levy CLT 
A.15 Liapounov CLT 
A.16 Cramer—Wold Device 


A.17 Limit Normal Product Rule (A.15) 
A.18 Asymptotic Distribution (A.17) 
A.19 Asymptotic Variance (A.18) 
A.20 Estimated Asymptotic Variance (A.19) 
A.21 Asymptotic Efficiency 
A.22 Stochastic Order of Magnitude 


in Section A.6. Stochastic order of magnitude, a convenient notation commonly used 
in asymptotic analysis, is presented in Section A.7. Section A.8 presents some useful 
properties of expectations. 


A.2. Convergence in Probability 


Because of the intrinsic randomness of a sample we can never be certain that a se- 
quence by, such as an estimator A (often denoted 6 y to make clear that it is a se- 
quence), will be within a given small distance of its limit, even if the sample is in- 
finitely large. However, we can be almost certain. Different ways of expressing this 
near certainty correspond to different types of convergence of a sequence of random 
variables to a limit. The one most used in econometrics is convergence in probability. 


A.2.1. Convergence in Probability 


Recall that a sequence of nonstochastic real numbers {ay} converges to a if, for any 
€ > 0, there exists N* = N*(e) such that, for all N > N%*, 


lay —a| < €. 
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For example, if ay = 2 + 3/N, then the limit is a = 2 since jay — a| = |2+3/N — 
2| = |3/N| < e forall N > N* = 3/e. 

When more generally we have a sequence of random variables we cannot be certain 
of being within ¢ of the limit, even for large N, because of intrinsic randomness. 
Instead, we require that the probability of being within € is arbitrarily close to one. 
Thus we require 


lim Pr[|by — b| < €] = 1, 
N>o 
for any £ > 0. A formal definition is the following: 


Definition A.1 (Convergence in Probability): A sequence of random variables 
{by} converges in probability to b if, for any € > 0 and ô > 0, there exists 
N* = N*(e, ô) such that, for all N > N*, 


Pr{lby —b| < £] > 1— ô. (A.1) 


We write plim by = b, where plim is shorthand for probability limit, or by * b. 

Note that b may be a constant or a random variable. Convergence in probability 
includes as a special case the usual definition of convergence for a sequence of real 
variables. 

Definition A.1 is for a sequence of scalar random variables. The extension to vector 
random variables, such as a parameter vector estimator, is straightforward. We can 
either apply the theory for each element of b y, or replace |by — b| by the scalar (by — 
by (by — b) = (biy — bi)? +--+ + (Ow — bx)’ or its square root ||by — b||. 

When the sequence {by} is a sequence of parameter estimates O, we have the fol- 
lowing large sample analogue of unbiasedness. 


Definition A.2 (Consistency): An estimator 6 is consistent for Oo if 
plim 6 = 69. (A.2) 


The subscript 0 on @ is explained in Section 5.2.3. Note that unbiasedness need 
not imply consistency. Unbiasedness states only that the expected value of 0 is 60o, 
and it permits variability around 09 that need not disappear as the sample size goes to 
infinity. Also, a consistent estimator need not be unbiased. For example, adding 1/N 
to an unbiased and consistent estimator produces a new estimator that is biased but 
still consistent. 

Although the sequence of vector random variables {by} may converge to a random 
variable b, in many econometric applications {by} converges to a constant. For ex- 
ample, we hope that an estimator of a parameter will converge in probability to the 
parameter itself. One should be aware that some of the results that follow apply only 
if the limit value b is a constant. 


Theorem A.3 (Slutsky’s Theorem): Let by be a finite-dimensional vector of 
random variables, and g(-) be a real-valued function continuous at a constant 


945 


ASYMPTOTIC THEORY 


vector point b. Then 
by > b > g(by) > g(b). (A.3) 


Proof is given in Amemiya (1985, p. 79). Ruud (2000) presents a related result (see 
also Rao, 1973, p. 124) that lets the limit b be a random variable, at the expense of 
restricting g(-) to be continuous everywhere. Note that some authors instead refer to 
Theorem A.12 below as Slutsky’s Theorem. 

Theorem A.3 is one of the major reasons for the prevalence of asymptotic re- 
sults versus finite-sample results in econometrics. It states a very convenient property 
that does not hold for expectations. For example, plim(b,y, boy) = (b1, b2) implies 
plim(b,yb2y) = bıb2, whereas E[b,yb2y] generally differs from E[b;]E[b2]. 


A.2.2. Alternative Modes of Convergence 


It is often easier to establish alternative modes of convergence, which in turn imply 
convergence in probability. 

These alternative modes are given for completeness. Laws of large numbers, given 
in the next section, are used much more often. 


Definition A.4 (Mean-Square Convergence): A sequence of random variables 
{by} is said to converge in mean square to a random variable b if 


slim E[(by — b)?] = 0. (A.4) 


We write by —> b. Convergence in mean square is useful because by —> b implies 
by > b (see Rao, 1973, p. 110) and is often easy to prove. This does require existence 
of the variance of by, however. If E[by] = b, then we need to show that the variance 
of by goes to zero as N — ov. If by is instead biased for b then we require that the 
sum of the variance and bias squared goes to zero. 

Another result that can be used to show convergence in probability is Chebychev’s 
inequality. 


Theorem A.5 (Chebyshev’s Inequality): For any random variable Z with mean 


u and variance o, 


Pr[(Z — u}? > k] <07/k, foranyk > 0. (A.5) 


For a proof see Rao (1973, p. 95). The generalized Chebychev’s inequality replaces 
(Z — u)? in Theorem A.5 by any nonnegative function g(Z) and shows that Pr[g(Z) > 
k] < E[g(Z)]/k, for any k > 0. See Amemiya (1985, p. 87). 

Theorem A.5 can be used to verify convergence in probability by replacing Z with 
by. The theorem requires the mean and variance of by, which are easily obtained for 
estimators that involve an average of independent random variables. However, in such 
cases we can often take an even easier route and directly apply a law of large numbers 
to the average to obtain the probability limit. 
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A conceptually more difficult type of convergence is almost sure convergence. 


Definition A.6 (Almost Sure Convergence): A sequence of random variables 
{by} is said to converge almost surely to b if 


Pr[ lim by = b]. (A.6) 
N->oo 


This is denoted by & b. Almost sure convergence implies convergence in proba- 
bility (see Rao, 1973, p. 111). Convergence in probability allows more erratic behavior 
in by than does almost sure convergence. 

Almost sure convergence is also called strong consistency for b, to distinguish 
it from convergence in probability, which is called weak consistency for b. Conver- 
gence in probability is easier to understand and is sufficient for most econometric 
applications. 


A.3. Laws of Large Numbers 


Laws of large numbers are theorems for convergence in probability (or almost surely) 
in the special case where the sequence {by} is a sample average, that is, by = Xn, 
where 


KS LG, (A.7) 


Note that X; here is general notation for a random variable, and in the regression 
context it does not necessarily denote the regressor variables. 

A law of large numbers provides a much easier way to establish the probability 
limit of a sequence {by} than the alternatives of brute-force use of the (8, £) definition 
given in (A.1) or use of alternative modes of convergence that imply convergence in 
probability. 


Definition A.7 (Law of Large Numbers): A weak law of large numbers 
(LLN) specifies conditions on the individual terms X; in X y under which 


(Xj = EX) 0. (A.8) 


For a strong law of large numbers the convergence is instead almost surely. 

It can be helpful to think of a LLN as establishing that Xy goes to its expected 
value, even though strictly speaking it implies the weaker condition that Xy goes to 
the limit of its expected value, since (A.8) implies that 


plim Xy = lim E[Xy]. 


If the X; have common mean p, then this simplifies to plim Xn =. 
Two leading examples of laws of large numbers are the following: 


Theorem A.8 (Kolmogorov LLN): Let {X;} be iid (independent and iden- 
tically distributed). If and only if E[X;] = u exists and E[|X;|] < œ, then 
(Xv —E[Xy]) Š 0. 
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Theorem A.9 (Markov LLN): Let {X;} be inid (independent but not identi- 
cally distributed) with E[X;] = u; and V[X;] = 07. If XZ, ŒX; — wil'*?)/ 
i'+8) < œ, for some ô > 0, then (Xy — E[Xy]) > 0. 


See White (2001a, p. 32 and p. 35) for statements of these theorems and Rao (1973, 
pp. 114-116) for proofs. Both laws give the stronger result of almost sure convergence, 
which implies the desired convergence in probability. Rao (1973) calls Theorem A.8 
Kolmogorov LLN2 and presents Theorem A.9 for the special case ô = 1, which he 
calls Kolmogorov LLNI. 

The Kolmogorov LLN allows the variance of X; to not even exist, at the expense 
of requiring an identical distribution. It simplifies to Xy > p, where u = E[X]. A 
weak version of this law, sufficient for most econometrics applications, is Khinchine’s 
Theorem, which states that for {X;} iid the existence of E[X] implies convergence in 
probability. 

The Markov LLN no longer requires an identical distribution, but it does require 
existence of an absolute moment beyond the first. An obvious choice of 6 is 6 = 1. 
Then the variance is needed and the side condition is that $72 | (7/i*) < oo. The 
variance can vary and even grow with i, provided it does not grow so fast that (o7/i7) 
has infinite sum. The side condition is satisfied if o? = o”, since )°>°, 1/ i? converges, 
but is not satisfied if o? = io, since $`% 1/i diverges. 

In most microeconometrics applications, including regression with stratified sam- 
pling or with fixed regressors, the more complicated Markov LLN is needed. 

Laws of large numbers are appealing because they require assumptions on the in- 
dividual components X;, rather than the sequence of averages Xy. They are the most 
common way econometricians prove convergence in probability, since most estima- 
tors and test statistics are functions of averages of the data and unobserved random 
variables. 


A.4. Convergence in Distribution 


Given consistency, the estimator 6 has a degenerate distribution that collapses on 6p as 
N — oo. We need to magnify or rescale @ to obtain a random variable that has nonde- 
generate distribution as N —> oo. Often the appropriate scale factor is vV N, in which 
case we consider the behavior of the sequence of random variables by = JNO N(0 — 6). 

In general, the Nth random variable in the sequence by has an extremely compli- 
cated cumulative distribution function (cdf) Fy. Like any other function Fy, this may 
have a limit function where convergence is in the usual mathematical sense. 


Definition A.10 (Convergence in Distribution): A sequence of random vari- 
ables {by} is said to converge in distribution to a random variable b if 


lim Fy = F, (A.9) 


N-0oo 


at every continuity point of F, where Fy is the distribution of by, F is the dis- 
tribution of b, and convergence is in the usual mathematical sense. 


We write by 4 b, and we call F the limit distribution of {by}. 
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Convergence in probability implies convergence in distribution; that is, by 46 
implies by 4p (see Rao, 1973, p. 122). 

In general, the converse is not true. For example, let by = Xy, the Nth realization 
of X ~ N[, o°]. Then by 4 b ~ N[u, o7], but clearly (by — b) has variance that 
does not disappear as N — 00, so by does not converge in probability to b. 


In the special case where b is a constant, however, by 4b implies by +. b (see 
Rao, 1973, p. 120). In this case the limit distribution is degenerate, with all its mass 
at b. 

To extend limit distribution to vector random variables simply define Fy and F 
to be the respective cdfs of vectors by and b. 


Theorem A.11 (Continuous Mapping Theorem): Let by be a finite- 
dimensional vector of random variables, and let g(-) be a continuous real-valued 
function. Then 


by + b > g(by) È g(b). (A.10) 


For proof see Rao (1973, p. 124). Theorem A.11 is the convergence in distribution 
analogue of Theorem A.3 for convergence in probability. 

The following theorem considers the effect of transforming a sequence with limit 
distribution by addition of, or multiplication by, or division by a sequence that con- 
verges in probability to a constant. 


Theorem A.12 (Transformation Theorem): /f ay x a and by +. b, where a 
is a random variable and b is a constant, then 


G) ay+tby a+b, 
(ii) ayby $ ab, and (A.11) 
(iii) an /bn 4 a/b, provided Pr[b = 0] = 0. 


For proof see Rao (1973, p. 122). Theorem A.12 is also referred to as Cramer’s The- 
orem. It is also called Slutsky’s Theorem, the name we have applied to Theorem A.3. 

Theorem A.12 is exceptionally useful because it permits one to separately find the 
limit distribution of ay and the probability limit of by, rather than having to consider 
the joint behavior of ay and by. Result (ii) is especially useful and is sometimes called 
the Product Rule. 


A.5. Central Limit Theorems 


Central limit theorems are theorems on convergence in distribution when the sequence 
{by} is a sample average. A central limit theorem provides a simpler way to obtain 
the limit distribution of a sequence {by} than the alternatives such as brute-force use 
of (A.9). 

From a law of large numbers, the sample average has a degenerate distribution as it 
converges to a constant, limE[X y]. So we scale (X y —ELX y]) by its standard deviation 
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to construct a random variable with unit variance that may converge to a nondegenerate 
distribution. 


Definition A.13 (Central Limit Theorem): Let 
_ Žu -ElXy] 


ZN : 
VX] 


(A.12) 
where Xy is a sample average. A central limit theorem (CLT) specifies the 
conditions on the individual terms X; in Xy under which 


Zy Š NO, 1], (A.13) 


that is, under which Zy converges in distribution to a standard normal random 
variable. 


By construction Zy has mean 0 and variance 1, so what needs to be proved is the 
normality. Formal proofs of a CLT do this by obtaining the characteristic function, a 
generalization of the moment-generating function, of Zy and showing that it converges 
as N — oo to the characteristic function of the standard normal distribution. 

Note that if Xy satisfies a central limit theorem, then so too does h(N)Xy for 
functions h(-) such as A(N) = VN, since 


Zy = A(N)X y — E[A(N)X n] 
v= = . 
v VIAW)Xw] 
In many applications it is convenient to apply the central limit theorem to the normal- 
ization /NXy = N7!/2 ee X;, since V[VN Xn] is finite. 
Examples of central limit theorems include the following: 


Theorem A.14 (Lindeberg—Levy CLT): Let {X;} be iid with E[X;] = u and 
V[X;] = 02. Then Zy % N{O, 1]. 


For a proof, see Rao (1973, p. 127). 
This is the CLT that usually appears in introductory statistics texts and is useful in 
the iid case. Since X; is iid [0, 07], Zy simplifies to the more familiar 


_ Xv pb 


= a 


Note that in the iid case only the existence of jz is required to ensure that Xy Z n, 
whereas to obtain a limiting normal distribution requires the additional assumption 
that o? exists. 

In applications such as OLS with fixed regressors the iid assumption is inappro- 
priate. One can apply a CLT for {X;} inid, though additional assumptions need to be 
made. 


Zy 


Theorem A.15 (Liapounov CLT): Let {X;} be independent with E[X;] = hi 
and V[X;] = o . flim ELIX; — mi P D/A oA = 0, for some 


choice of ô > 0, then Zy 4 NO, 1]. 
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This variant of the Liapounov CLT is proved in White (2001a, p. 119). Rao (1973, 
p. 128) presents the special case 5 = 1. 

The main additional assumption in the Liapounov CLT is the existence of an abso- 
lute moment of order higher than two. Note also the additional assumptions compared 
to the corresponding LLN for iid data. For X; inid 


— Si Xi— Tii Mi 
N 
viet a? 


Theorems A.14 and A.15 are special cases of the more general Lindeberg—Feller 
CLT (see Rao, 1973, p. 128). The Lindeberg—Feller CLT has a side condition that can 
be difficult to verify. 

In most microeconometrics applications, including regression with stratified sam- 
pling or with fixed regressors the more complicated Liapounov CLT is used. 


Zy 


A.6. Multivariate Normal Limit Distributions 


In this section we focus on the typical microeconometrics case of estimators with 
multivariate normal limit distributions. 


A.6.1. Multivariate Normal Limit Distributions 


The central limit theorems presented were for sequences of scalar random variables. 
They can be extended to sequences of vector random variables using the following 
result. 


Theorem A.16 (Cramer—Wold Device): Let {by} be a sequence of random 
k x 1 vectors. If N'by converges to a normal random variable for every k x 1 
constant nonzero vector A, then by converges to a multivariate normal random 
variable. 


Rao (1973, p. 128) gives a more general result that is not restricted to normal dis- 
tributions. 

The advantage of this result is that, if by is a vector of averages, then A'by = 
Aibin +-+++Axben will be a scalar average and we can apply a scalar central limit 
theorem given in the previous section. This will yield 


Aby —Xr 
NADPN—SAHN d NTO, 1], 
VNVWA 
where py =E[by] and Vy= V[by], in which case we conclude that 
Vx? (by — uy) > N10, 0. (A.14) 


This result is explained further in Subsection A.6.3. 
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A.6.2. Linear Transformation 


Microeconometric estimators can often be expressed as JN (0 — 09) = Hyay, where 
plim Hy exists and ay has a limit normal distribution. The distribution of this product, 
or linear transformation of ay, can be obtained directly from part (ii) of Theorem A.12 
(Transformation Theorem). We restate it in a form that arises for many estimators. 


Theorem A.17 (Product Limit Normal Rule): Zf a vector ay A N[p, A] and 
a matrix Hy > H, where H is positive definite, then 


Hyay $ N[Hp, HAH’). (A.15) 


Theorem A.17 can be directly applied to an estimator. For example, the OLS esti- 
mator 
JN (@ - Bp) = (+x'x ~ 1X 
0 N JN 


is treated as the product of Hy = (N~!X'X)~! and ay = N~'/?X’u and we find the 
plim of Hy and the limit distribution of ay. 

Theorem A.17 can also be used to justify replacement of a limit distribution vari- 
ance matrix by a consistent estimate without changing the limit distribution. If we have 
shown that 


VN (6 — 60) & NTO, B], 
then it follows by Theorem A.17 that 
By!” x VN (6 — 6) > MIO, 1] 


for any By that is a consistent estimate for B and is positive definite. 


A.6.3. Limit Variance Matrix 


A formal multivariate CLT yields a notationally cumbersome result such as (A.14). 
Premultiplying by vir and applying Theorem A.17, we can reexpress this in the sim- 
pler form 


by — uy > NO, V], 


where V = plim Vy and we assume by and Vy are appropriately scaled so that V 
exists and is positive definite. 

Different authors express the limit variance matrix V in different ways. 

A general definition is simply 


V = plim Vy. 


This is the most common way that results are presented and is the form used in this 
text. In the fixed regressors case it simplifies to V = lim Vy. 
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In microeconometrics estimation examples the matrix V y is often a matrix average, 
say 


1 N 
Vw = JA Si 
i=1 


where S; is a square matrix that is a function of parameters and data for the ith obser- 
vation. Given independence over i a law of large numbers can usually be applied so 


that Vy—E[Vy] > 0. Then 
1 N 
V =limE[Vy] =lim = 2 E[S;]. 


This is the type of expression used by Amemiya (1985). 
If the S; are iid then E[S;] =E[S] is the same for all observations. So simple random 
sampling leads to the simpler expression 


V =E[S], 


a form used for example by Newey and McFadden (1994) and Wooldridge (2002). 

As an example, consider the OLS estimator with homoskedastic error, so that 
JN (8 — fy) EA N[0, o°M3}]. Then Myx = plim N~! 5°, x;x; can be re-expressed 
as Myx = lim N7! X ;E[x:x;] if a law of large numbers applies, and as Myx = E[xx’] 
under simple random sampling. 

More complicated forms of V arise, such as the sandwich form ABA’. The preced- 
ing discussion is then applied to each component. For example, B = plim By may be 
expressed as B = limE[By] or as B = E[S] under random sampling if B = N~! >; Si- 


A.6.4. Asymptotic Distribution and Variance 


To obtain the limit distribution of an estimator we work with the sequence by = 
JN (0 — 0o) for theoretical reasons to ensure a nonzero variance of by as N — ov. 
Then the limit distribution of by is anormal distribution, and many authors say that by 
is asymptotically normal and call the limit variance matrix the asymptotic variance 
of b N- 

It can be convenient to reexpress results in terms of the distribution and variance 
matrix of 6 itself. 


Definition A.18 (Asymptotic Distribution of 0): If 
VN (0 — 4) & NTO, B], (A.16) 


then we say that in large samples @ is asymptotically normally distributed 
with 


8 ~N [0o, N~'B], (A.17) 


where the term “in large samples” means that N is large enough for (A.16) to be 
a good approximation but not so large that the variance in (A.17) goes to zero. 
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The result (A.17) follows from (A.16) since dividing a random variable by JN 
leads to division of its variance by N. 

A shorthand notation is to implicitly presume asymptotic normality and use the 
following terminology. 


Definition A.19 (Asymptotic Variance of 0): If (A.16) holds then we say that 
the asymptotic variance matrix of 0 is 


V[6] = N7'B. (A.18) 


Definition A.20 (Estimated Asymptotic Variance of 0): If (A.16) holds then 
we say that the estimated asymptotic variance matrix of 0 is 


FO] = N'S, (A.19) 


where B is a consistent estimate of B. 


Some authors use Avar[6] and Avar[6] in Definitions A.19 and A.20 to avoid poten- 
tial confusion with the variance operator V[-]. It should be clear that here v[ð] means 
asymptotic variance of an estimator since few estimators in this book have closed-form 
expressions for the finite-sample variance. 

As an example of Definitions A.18—A.20, if {X;} are iid [u, 07] then the Lindeberg- 


Levy central limit theorem leads to /N(Xy — w)/o 4N [0, 1], or equivalently that 
VNXy Š Nip, 02]. We say that asymptotically Xy ~ N[w,o7/N]; the asymptotic 
variance of Xy is 0” /N; and the estimated asymptotic variance of Xy is s? /N, where 
s? is a consistent estimator of o° such as s? = X; (X; — Xn)’ /(N — 1). 


A.6.5. Asymptotic Efficiency 


In finite samples the Cramer—Rao lower bound for the variance—covariance matrix of 
unbiased estimators is —(E[ 3? In Ly /0006'|, ])~!. This result extends to consistent 
estimators that are asymptotically normal. 


lo! 


Definition A.21 (Asymptotic Efficiency): A consistent asymptotically normal 
estimator @ of @ is said to be asymptotically efficient if it has an asymptotic 
variance—covariance matrix equal to the Cramer—Rao lower bound. 


A.7. Stochastic Order of Magnitude 


A useful notation for rates of convergence of sequences of variables is the order of 
magnitude of a sequence using (O, o) notation, or big-O, little-o notation. 

A sequence of nonstochastic real numbers ay is O(g(N)), if lim(ay /g(N)) is finite 
nonzero, and is o(g(N)), if lim(ay/g(NV)) is zero. Thus ay is O(g(N)) if it is of the 
same order of magnitude as the function g(N) and is o(g(N)) if it is of smaller order 
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of magnitude than g(N). For example, (3/N) + (5/N)* is O(1/N) or O(N~!), as it 
behaves for large N like a constant times N~! and is o(N~!/) but larger than o(N7!). 

This notation has been extended to stochastic orders of magnitude of sequences 
of random variables. The notation becomes (O,, 0p) notation. 


Definition A.22 (Stochastic Order of Magnitude): A sequence of random vari- 
ables by is O,(g(N)) if 


_ by 
0 < plim —— < œ 
g(N) 


and is 0,(g(N)) if 


b 
plim —*_=0 
& 


Most often g(V) = N~“ for some constant c > 0. An estimator @ consistent for A 
can be written as 0 = 69 + 0p(1), since it equals 6o plus a term that goes to zero in 
probability. An estimator @ that is additionally root-N consistent for 6o can be written 
as 0 = 6) + O,(N~!/2), since then N'/2@ — 6) = O,(1). 


A.8. Other Results 


This section contains some key finite sample results on conditional expectation and on 
the interchange of expectations and transformation. 


Theorem (Law of Iterated Expectations): For random variables Y and X 
E[Y] = Ex[Ey;x[Y|X]], 


where E[-] denotes the unconditional or marginal mean of Y, Ex[-] denotes un- 
conditional expectation with respect to the marginal cdf of X, and Ey|x[-|X] 
denotes conditional expectation with respect to the conditional distribution of Y 
given X. 


This result means that if we first obtain the conditional mean of Y given X, and 
then take the expected value over X, we will obtain the unconditional mean of Y. See 
Rao (1973, p. 97) for a proof. For example, if E[u|x] = 0 then E[u] = E,[E[u|x]] = 
E,[0] = 0. 


Theorem (Decomposition of Variance): For random variables Y and X 
VIY] = Ex[Vy x[Y|X]] + VxlEyx[Y |X], 


where V[Y] denotes the unconditional variance of Y, Ex[-] denotes uncondi- 
tional expectation with respect to the marginal cdf of X, Vy\x[Y|X] denotes the 
conditional variance of Y given X, Vx[-] denotes variance with respect to the 
unconditional distribution of X, and Ey\x|[-|X] denotes conditional expectation 
with respect to the conditional distribution of Y given X. 
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In words, the unconditional variance of Y equals the sum of (1) the expected value 
(over X) of the conditional variance and (2) the variance (over X) of the conditional 
mean. A simple way to remember this is to recognize that the unconditional variance 
equals EV plus VE. See Rao (1973, p. 97) for a proof. 


Theorem (Jensen’s Inequality): Zf Z is a random variable such that E[ Z] exists, 
and g(-) is a convex function, then 


g(E[Z]) < Elg(Z)]. 
If instead g(-) is a concave function then 


g(E[Z]) > Elg(Z)]. 


This result, proved in Rao (1973, p. 58), is very important for nonlinear mod- 
els. It emphasizes the difference between behavior of the average individual and 
average behavior. For example, suppose an exponential model is appropriate, with 
E[y|x] = exp(x’). Then since the exponential function is concave, Jensen’s Inequal- 
ity implies that exp(E[x’G]) > E[exp(x’3)]. The conditional mean evaluated at the in- 
dividual with average characteristics x = E[x] exceeds the unconditional mean E[y] = 
E[E[y|x]] = Elexp(x’G)]. 


A.9. Bibliographic Notes 


A classic source with proofs is Rao (1973, pp. 108-130), who we cite wherever possible. The 
results summarized also draw heavily on the books by Amemiya (1985, Chapter 3) and White 
(2001a). 

Graduate-level textbooks such as Greene (2003) provide summaries of key results. More 
advanced texts by Davidson and MacKinnon (1993), Hendry (1995), Ruud (2000), and 
Wooldridge (2002) provide treatments at least as detailed as that here. Davidson (1994) pro- 
vides a book-length treatment of stochastic theory for the econometrician. As already noted ter- 
minology can differ across references, especially in the use of Slutsky’s Theorem and Cramer’s 
Theorem. 
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Making Pseudo-Random Draws 


In this appendix we state the density or probability mass functions and first two mo- 
ments of leading univariate distibutions and present methods to generate random draws 
from these distributions. 


Table B.1. Continuous Random Variable Densities and Moments“ 


Random Variable pdf f(x) Mean; Variance 
. (a+b). (a-b? 
Uniform U/[a, b] 1/(b — a), Soret cas Vises 
(x—py 
Normal N[u, 07] sme 207 uo? 
Exponential E[A] he**, A>O Neale 
Gamma G[a,b] TOF xi le 3 ab; ab? 
T(a+b) „a—1 b—1 a. ab 
Beta B[a,b] Taree (-—x)", a+b? (@tbe(a+b+1) 
Logistic £[a,b] eb /[bU +e D Y], -œ <a <œ a;(br)}/3 
Chi-Square x? (n) ESS n;2n 
rež) pt 
tt(v) f@)= re Teel +7) 2 0; 5, for v > 2 
(242 \(y/wye? 12-1 
F F(w, v) TOS e zz- for v > 2; 
5 w+v 
x(x+ 2) 2 uot w=2D forv>4 


w(v—4)(v—2)? ” 


^ All parameters are restricted as follows: b > a for the Uniform; jz unrestricted, o” > 0 for the Normal; à > 0 
for the Exponential; a, b > 0 for the gamma; a, b > 0 for the Beta; a unrestricted and b > 0 for the Logistict; 
v is an integer for the t-distribution; for the F-distribution v and w must be integers. 
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Table B.2. Continuous Random Variable Generators 


Random Variable Range of Values Random Variable Generator 


Uniform U/[a, b] a<x<b x=a+(b—-a)r, r~U(0, 1] 
2 xy = w+o/7/—2In(r7}) cos(271r2) 
Normal N[u, o°] OO < X1, X2 < 0O | N a eee 


[r1, r2 ~ U[O, 1]; the resulting pair xı and xz are independent random variables.] 


Exponential €[A] 0<x< œ x = —-; ln(r) 


(i) x = — 4 n(1$_r;) or 
Gamma G[a,b] 0<x <œ I= Pe Ei 
(ii) x = —¢ [na 71) yy] 
(i) r; ~ U[O, 1]; a is integer. E;s are iid exponential random variates. 
When a = 1, we have an exponential random variable 
(ii) a is non-integer.a = m +q,0 <q < 1,m = integer, 
yı, 2 are independent B(q, 1 — q) and €(1). 


© x= he ae), 
Gi)x=ri/C{ +r). Cf +r) <1 
(i) a, b are integers. yı is G(k, a), y2 is G(k, b). 

k can be chosen arbitrarily. 


(ii) a, b are non-integer r; ~ U/[0, 1]; successive pairs of rı and rz are 
1 1 
generated until (rf +r7) < 1. 


Beta B[a,b] O<x<l 


Logistic £[a,b] —-~wo<x<@ x =a + bn) 
[r ~ U[0, 1] 
Chi-Square x?(n) 0 <x Yaj y? 


[n is an integer; y;s are independent M (0, 1).] 


tt(v) — 0 <x <0 x = yı / v y2 /V 
[yı is M(0, 1); y2, independent of yı, is x?(v).] 


F F(w, v) O<x x = (y1/w)/(y2/v) 
[ y2, and yı, are independent x(w), x(w) respectively.] 
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Table B.3. Discrete Random Variable Probability Mass Functions and Moments 


Random Variable ° pmf f(x) Mean; Variance 
Binomial Bi[n, p] ( ) p*(1— p~ np;np(1 — p) 
Poisson P[A] e*A* /x! ASA 


n(1— p) n(1— p) 
? 2 
P P 
“ For the binomial 0 < p < 1 and n is a positive integer; for the Poisson à > 0; and for the negative bino- 
mial0 < p<l,n> l1. 


Negative binomial NB[n, p] ( n a E ) p” -— py* 


Table B.4. Discrete Random Variable Generators 


Random Variable Range of Values Random Variable Generator 
set x = 0; 
do the loop n times 
Binomial Bi(n, p) x=0,1,...,n generate r uniform on [0,1] 
ifr < p, thnx =x+1 
output x 


setx =0;t =0 

do the loop until t < A 
generate exponential random variable y 
sett =t +y 
x=x+1 

output x 


Poisson P(A) SOS Te 


generate A from G(n, a) 
K=O, hes generate x from P(A) 
output x 


Negative binomial 
NB(n, p) 
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accelerated failure time (AFT) model, 591-2 
coefficient interpretation, 606-7 
definition, 592 
leading examples, 585 

accept-reject methods, 413-4, 445 

ACD. See average completed duration 

acronyms, 17 

AD estimator. See average derivative 

adaptive estimator, 323, 328, 684 

adding-up constraints, 210 

additive model, 323, 327, 523 

additive random utility model (ARUM) 
binary outcome models, 476-8 
generalized random utility models, 515-6 
identification, 504 
multinomial outcome models, 504—7 
nested logit model, 509, 526-7 
RPL model, 513 
welfare analysis in, 506-7 

admissible estimator, 435 

AFT. See accelerated failure time 

aggregated data 
binary outcomes, 480-2 
cohort-level, 772 
nonlinear models, 482, 487 
multinomial outcomes, 513 
time-aggregated durations, 578, 600-3 
see also discrete-time duration data 

AIC. See Akaike information criterion 

AID. See average interrupted duration 

Akaike information criterion (AIC), 278-9, 284, 624 

almost sure convergence, 947-8 

analog estimator, 135 

analogy principle, 135 
and method of moments estimators, 167 

analysis of covariance, 733 

analysis of variance, 733 

Anscombe residual, 289 


antithetic sampling, 408-9, 445 
applications with data 


competing risks models, 658-62 

duration models, 603-8, 632-6 

IV estimation, 110-2 

kernel regression, 295-7, 300 

logit and probit models, 464-6, 486 

multinomial and nested logit models, 491-5, 511 
Poisson and negative binomial models, 671-4, 690 
panel fixed and random effects estimation, 708-15 
panel GMM linear estimation, 754-6 

panel nonlinear estimation, 792-5 

quantile regression, 88—90 

selection and two-part models, 553-6, 565 
survival function, 574—5, 582 

treatment evaluation estimation, 889-96 

see also data sets used in applications 


Archimedean family, 654 
Arellano-Bond estimator, 765-6, 777 


application, 754-6 
nonlinear models, 791 
unit roots, 768 


ARMA. See autoregressive moving average 
artificial nesting, 283 

ARUM. See additive random utility model 
asymptotic distribution, 953-4 
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asymptotic efficiency, 954 

asymptotic normal distribution, 953 

definition, 74, 120, 953 

estimated asymptotic variance, 954 

of extremum estimators, 127-31 

of FGLS estimator, 82-3 

of FGNLS estimator, 156-7 

of first-differences estimator, 730-1 

of fixed effects estimator, 727-9 

of GMM estimator, 173-4, 182-3, 185-6, 194-5, 
745-6 

of Hausman test statistic, 271-4 
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of kernel density estimator, 301-2, 330-1 
of kernel regression estimator, 313, 331-3 
of LM test statistic, 235, 237-8 
of LR test statistic, 235, 237 
of m-estimators, 119-21 
of MD estimator, 292 
of ML estimator, 142-3 
of MM estimator, 134, 174 
of MSL estimator, 394—5 
of MSM estimator, 400-2 
of m-test statistics, 260, 263 
of NLS estimator, 152-4 
of NL2SLS estimator, 195-6 
of OIR test statistic, 181, 183 
of OLS estimator, 73-4, 80-1 
of panel GMM estimator, 745-6 
of quasi-ML estimator, 146 
of random effects estimator, 735 
of Wald test statistic, 226-8 
see also asymptotic theory 
asymptotic efficiency, 954 
of optimal GMM, 177 
asymptotic refinement, 359, 371-2 
by bootstrap, 256, 363-7, 371-2, 378-9 
definition, 359 
by Edgeworth expansion, 371-2 
by nested bootstrap, 374, 379 
asymptotic theory definitions, 943-55 
asymptotic distribution, 953 
asymptotic variance, 954 
central limit theorems, 949-52 
consistency, 945 
convergence in distribution, 948-9 
convergence in probability, 944-7 
laws of large numbers, 947-8 
limit distribution, 948 
limit variance, 952-3 
stochastic order of magnitude, 954 
summary of definitions and theorems, 944 
asymptotic variance, 74, 120, 954 
estimated asymptotic variance, 74, 954 
see also asymptotic distribution 


asymptotically pivotal statistic, 359-60, 363-4, 366, 


372, 374, 379-80 
ATE. See average treatment effect 
ATET. See average treatment effect on the treated 
attenuation bias, 903—5, 911, 915, 919-20 
attrition bias, 739, 800-1, 940 
augmented regression model, 429 
autocorrelation 


in panel model errors, 705-8, 714-5, 722-5, 745-6 


dynamic panel models, 763-8, 791-2, 797-9, 
806-7 
see also panel-robust inference 
autoregressive moving average (ARMA) errors 
definition, 159 
NLS estimator, 159 
panel data, 722-5, 729 


auxiliary model, 404 
auxiliary regression 
bootstrapping, 379, 382 
example, 241-3, 269-71 
Hausman test, 276, 718-9 
LM test, 240-1, 274 
m-test, 261-4, 544 
available case analysis. See pairwise deletion 
average completed duration (ACD), 626 
average derivative (AD) estimator 
definition, 326 
uses, 317, 483 
average interrupted duration (AID), 626 
average selection bias, 868 
average squared error, 315 
average treatment effect (ATE), 33-4, 866-71 
definition, 866 
difficulties estimating, 866 
local ATE, 883-6 
matching estimators, 871-8 
potential outcome model, 33-4 
selection on observables only, 868-9 
selection on unobservables, 868-71 
see also ATET; LATE; MTE 
average treatment effect on the treated (ATET), 
866-78 
application, 889-6 
definition, 866 
difficulties estimating, 866 
matching estimators, 871-8, 894-6 
selection on observables only, 868-9 
selection on unobservables, 868-71 
see also ATE; LATE; MTE 
averaged data. See aggregated data 


backward recurrence time, 626 
balanced bootstrap, 374 

balanced repeated replication, 855 
balancing condition, 864, 893-4 
bandwidth, 299, 307, 312 


bandwidth choice for kernel density estimator, 302-4 


cross validation, 304 
example, 296-7 
optimal, 303, 306 
Silverman’s plug-in estimate, 304 
bandwidth choice for kernel regression estimator, 
314-6 
cross validation, 314-6 
example, 297, 316 
optimal, 314, 318 
plug-in estimate, 314 
baseline hazard, 591 
in AFT model, 592 
identification in mixture models, 618—20 
in multiple spells models, 655-6 
in PH model, 591, 596-7, 601-2 
Bayes factors, 456-8 
Bayes rule. See Bayes theorem 
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Bayes theorem, 421 
example, 422-4, 435-9 
Bayesian central limit theorem, 433 
Bayesian information criterion (BIC), 278, 284 
see also AIC 
Bayesian methods, 419-59 
Bayes 1764 example, 458-9 
Bayesian approach, 420-35 
binary outcome models, 475 
compared to non-Bayesian, 164, 424-5, 432-41, 
439-41 
count models, 687 
data augmentation, 454-5, 932-3, 935-9 
decision analysis, 434-5 
examples, 452-4 
hierarchical linear model, 847 
importance sampling, 443-5 
linear regression, 435-43, 449-50, 452-4 
Markov chain Monte Carlo simulation, 445-54, 
935-9 
measurement error model, 915 
mixed linear model, 775 
model selection, 456-8 
multinomial outcome models, 514, 519 
panel data, 775, 809 
posterior distribution, 421, 430-4 
prior distribution, 425-30 
Tobit model, 563 
BCA method. See bias-corrected and accelerated 
before-after comparison 
application, 890-1 
Berkson error model, 920 
Berkson’s minimum chi-square estimator, 480-1 
Berndt, Hall, Hall, and Hausman (BHHH) estimate, 
138, 241, 395 
Berndt, Hall, Hall, and Hausman (BHHH) iterative 
method, 343-4 
Bernoulli distribution, 140, 148, 468, 475, 483 
Bernstein-von Mises Theorem, 433, 459 
best linear unbiased predictor, 738, 776 
between estimator, 702, 736, 841 
application, 710-3 
between-group variation, 709, 733 
between model, 702 
BFGS algorithm. See Boyden, Fletcher, Goldfarb, and 
Shannon 
BHHH estimate. See Berndt, Hall, Hall, and Hausman 
BHHH method. See Berndt, Hall, Hall, and Hausman 
bias-corrected and accelerated (BCA) bootstrap 
method, 360 
biased sampling, 42-5, 626-7 
see also sample selection; endogenous stratification 
BIC. See Bayesian information criterion 
binary endogenous variable, 562 
binary outcome models, 463-89 
additive random utility model, 476-8 
aggregated data, 480-2 
alternative-invariant regressors, 478 


alternative-varying regressors, 478 
choice-based samples, 478-9 
corrected score estimator, 916-8 
definition, 466 
example, 464-5 
identification, 476, 483 
index function model, 475-6 
marginal effects, 467, 470-1 
measurement error in dependent variable, 914 
measurement error in regressors, 919 
ML estimator, 468-9 
model misspecification, 472 
multiple imputation example, 937-8 
OLS estimator, 471 
panel data, 795-9 
semiparametric estimation, 482-6 
see also logit models; probit models 
binding function, 404—5 
bivariate counts, 215, 685-7 
bivariate negative binomial distribution, 686-7 
bivariate ordered probit model, 523 
bivariate Poisson distribution, 686 
bivariate Poisson-lognormal mixture, 686 
bivariate probit model, 522-3 
bivariate sample selection model, 547-53 
application, 553-5 
bounds, 566 
conditional mean, 548—50 
conditional variance, 549-50 
definition, 547 
Heckman two-step estimator, 550-1 
identification, 551, 565-6 
marginal effects, 552 
ML estimator, 548 
outcome equation, 547 
participation equation, 547 
semiparametric estimator, 565-6 
versus two-part model, 546, 552-3 
Bonferroni test, 230 
bootstrap hypothesis tests 
asymptotic refinement, 363-4, 366-7, 371-2, 
378-9 
bootstrap critical value, 256, 363 
bootstrap p-value, 256, 363 
example, 366-8 
nonsymmetrical test, 363, 380 
power, 372-3 
symmetrical test, 363 
without asymptotic refinement, 363, 367-8, 
378 
bootstrap methods, 357-83 
asymptotic refinement, 359, 366-7 
bias estimate, 365 
bias-corrected estimator, 365, 368 
clustered data, 363, 377-8, 845 
confidence intervals, 364-5, 368 
consistency, 369-70 
critical value, 363 
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examples, 254-6, 366-8 independent censoring, 580 
for functions of parameters, 363 interval censoring, 579, 588 
general algorithm, 360 noninformative censoring, 580 
for GMM, 379-80 random censoring, 579 
heteroskedastic data, 363, 376-7 right censoring, 532, 579, 581, 589 
introduction, 254-6 sample selection, 44-5, 547 
for nonsmooth estimators, 373, 380-1 type 1 censoring, 579 
number of bootstrap samples, 361-2 type 2 censoring, 580 
panel data, 363, 377-8, 708, 746, 751 census coefficient, 819 
p-value, 363 central limit theorem (CLT), 949-2 
recentering, 374, 379 Cramer linear transformation, 952 
rescaling, 374 Cramer-Wold device, 951 
sampling methods for, 360 definition, 950 
smoothness requirements, 370 examples of use, 80, 130 
standard error estimate, 362, 366 Liapounov CLT, 950 
time series data, 381 Lindeberg-Levy CLT, 950 
variance estimate, 362 multivariate, 95 1—2 
without asymptotic refinement, 358, 367-8 sample average, 949 
see also bootstrap hypothesis tests sampling scheme, 131, 950 
bounds identification, 29 CGF tests. See chi-square goodness-of-fit 
in measurement error models, 906-8 characteristic function, 370, 913, 950 
bounds in selection model, 566 chatter, 394, 410 
Boyden, Fletcher, Goldfarb, and Shannon (BFGS) Chebychev’s inequality, 946 
algorithm, 344 chi-square goodness-of-fit (CGF) tests, 266-7, 270-1, 
474 
CAIC. See consistent Akaike information criterion choice-based samples, 823 
calibrated bootstrap, 374 binary outcome models, 478-9 
caliper matching, 874, 895 see also endogenous stratification 
canonical link function, 149, 469, 783 Choleski decomposition, 416, 448 
case-control analysis, 479, 823 CL model. See conditional logit 
causality, 18-38 CLAD estimator. See censored least absolute 
examples, 69-70, 98 deviations 
Granger causality, 22 Clayton copula, 654 
identification frameworks and strategies, CLT. See central limit theorem 
35-3 clustered data, 829-53 
in linear regression model, 68-9 application, 848-53 
in potential outcome models, 32-4, 862-5 cluster bootstrap, 363, 377-8, 845 
in simultaneous equations model, 26-7 cluster-robust inference, 707, 834, 842, 
in single-equation model, 31 845 
and weighting, 820-1 cluster sampling, 41-2 
see also endogeneity cluster-specific effects, 830-2, 837-45 
cdf. See cumulative distribution function comparison to panel data, 831-2 
censored least absolute deviations (CLAD) estimator, diagnostic tests, 841 
564-5, 808 dummy variables model, 840 
censored models, 530-44, 579-80 fixed effects estimator, 840-1, 843-5 
conditional mean, 535 hierarchical models, 845-8 
count models, 680 large clusters, 832 
definitions, 532, 579-80 nonlinear models, 841—5 
examples, 530-1, 535 OLS estimator, 75, 833-7 
ML estimator, 533-4 quasi-ML estimator, 150 
semiparametric estimation, 563-5 random effects estimator, 837-9, 843 
see also duration model; selection models; Tobit small clusters, 832 
models; truncated models see also panel data 
censored normal regression model. See Tobit model cluster-robust standard errors 
censoring mechanisms, 532, 579-80 bootstrap, 363, 377-8, 845 
censoring from above, 532, 579 clustered data, 834, 842 
censoring from below, 532, 579 panel data, 706-7, 745-6, 789 
left censoring, 532, 579, 588 see also robust standard errors 
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cluster-specific fixed effects (CSFE) estimator, 
839-41, 843-4 
application, 848-53 
between estimator, 840-1 
nonlinear models, 843-4 
within estimator, 140-1 
cluster-specific fixed effects (CSFE) model, 831, 843 
cluster-specific random effects (CSRE) estimator, 
837-9, 843-4 
application, 848-53 
cluster-specific random effects (CSRE) model, 831, 
843-4 
cluster variable, 707 
CM tests. See conditional moment 
coefficient interpretation 
in binary outcome models, 467, 473 
in competing risks model, 646 
in count model, 669 
in duration models, 606-7 
in misspecified linear model, 91-2 
in multinomial outcome models, 493-4, 501-3 
in nonlinear models, 122-4, 162-3 
in Tobit model, 541-2 
see also marginal effects 
coherency condition, 562 
cohort-level data. See pseudo panels 
cointegration, 382, 767 
common parameters, 801 
compensating variation, 500-7, 512 
competing risks model (CRM), 642-8, 658-62 
application, 658-62 
censoring, 642 
coefficient interpretation, 646 
definitions, 642-4 
dependent risks, 647-8 
exit route, 643 
identification, 646 
independent risks, 644-6 
ML estimator, 644—5 
proportional hazards, 645-6 
spell duration, 643 
with unobserved heterogeneity, 647, 659 
complementary log-log model, 466-7, 603 
complete case analysis. See listwise deletion 
complex surveys, 41-2, 814-6, 853-6 
composition methods, 415 
computational difficulties, 350-2 
concentration parameter, 109 
conditional analysis, 717 
conditional expectations, 955-6 
conditional independence assumption, 23, 863, 865 
definition, 863 
for participation, 863 
given propensity score, 865 
selection on observables only, 868 
unconfoundedness, 863 
conditional likelihood, 139-40, 824 
panel models, 731-2, 782-3, 796-9, 805 


conditional logit (CL) model, 500-3, 524-5 
application, 491-4 
definition, 500 
fixed effects binary logit, 797, 844 
marginal effects, 493, 501-3, 525 
ML estimator, 501 
from ARUM, 505 
see also multinomial outcome models 
conditional ML estimator, 731-2, 782-3, 796-9, 805, 
824 
conditional moment (CM) tests, 264-5, 267-9, 319 
consistent CM test, 268 
in duration models, 632 
example, 269-71 
in Tobit model, 544 
see also m-tests 
conditional mean 
squared error loss, 67-9 
conditional mode 
step loss, 68 
condition number, 350 
conditional quantile 
asymmetric absolute loss, 68 
confidence intervals, 231-2, 316, 364-5, 368 
consistent Akaike information criterion (CAIC), 278 
consistent test statistic, 248 
consistency 
definition, 945 
of extremum estimators, 125-7, 132-3 
of GMM estimator, 173-4, 182 
of m-estimator, 132-3 
of ML estimator, 142, 146-50 
of NLS estimator, 155 
of OLS estimator, 73, 80 
strong consistency, 947 
weak consistency, 947 
see also asymptotic distribution; identification; 
pseudo-true value 
constant coefficients model. See pooled model 
contagion, 612 
contamination bias, 903-4 
contemporaneous exogeneity assumption, 748-9, 752, 
781 
continuous mapping theorem, 949 
control function approach, 37 
control function estimator, 869-70, 890 
control group, 49 
conventions, 16-17 
convergence criteria, 339-40, 458 
convergence in distribution, 948-9 
continuous mapping theorem, 949 
definition, 948 
limit distribution, 948 
transformation theorem, 949 
vector random variables, 949 
see also central limit theorem 
convergence in probability, 944-7 
alternative modes of convergence, 945 
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consistency, 945 

definition, 945 

probability limit, 945 

Slutsky’s theorem, 945 

uniform convergence, 126, 301 

vector random variables, 945 

see also law of large numbers 
copulas, 216, 651-5 

count example, 687 

definition, 651-2 

dependence parameter, 653-4 

leading examples, 654 

ML estimator, 655 

survival copulas, 652 
correlated random effects model, 719, 786 
counterfactual, 32, 555, 861, 871 

see also potential outcome model 
count data, 665 

examples, 665 

heteroskedasticity, 665 

right-skewness, 665 

see also count models 
count models, 665-93 

censored, 680 

application, 671-4, 690 

endogenous regressors, 683, 687-9 

endogenous sampling, 823 

finite mixture models, 678-9 

hurdle models, 680-1 

measurement error in dependent variable, 915 

measurement error in regressors, 915-8 

mixture models, 675-7 

multivariate, 685-7 

OLS estimator, 684 

negative binomial model, 675-7 

NLS estimator, 684 

panel data, 792-5, 802-8 

Poisson model, 666-74 

sample selection, 680 

semiparametric regression, 684—5 

truncated, 679-80 

zero-inflated, 681 
covariance matrix. See variance matrix 
covariance structures, 177, 379, 753, 766-7 
covariates. See regressors 
Cox CRM model. See competing risks 
Cox PH model. See proportional hazards 
Cox-Snell residual, 289, 631, 633-6 
CPS. See Current Population Survey 
Cramer linear transformation, 952 
Cramer-Rao lower bound, 143, 954 

see also semiparametric efficiency bound 
Cramer’s theorem, 949 
Cramer-Wold device, 130, 951 
CRM. See competing risks model 
cross-equation parameter restrictions, 210 
cross-section data, 47 
cross-validation, 304, 314-6, 318, 321 


CSFE estimator. See cluster-specific fixed effects 
CSRE. See cluster-specific random effects 
cumulant, 370 
cumulative distribution function (cdf), 576 
cumulative hazard function 
definition, 577-8 
in competing risks model, 644-5 
as diagnostic tool, 631-2 
in likelihood function, 588 
Nelson-Aalen estimator, 582—4, 605-6, 662 
in proportional hazards model, 590 
Current Population Survey (CPS), 58, 814-5 
curse of dimensionality 
in Bayesian methods, 419-20 
multivariate kernel density estimator, 306 
multivariate kernel regression estimator, 319 
high-dimensional integrals, 393 


data augmentation, 454-5, 932 
imputation step, 455, 932 
for missing data, 932-8 
prediction step, 455, 933 
regression example, 933 
data-generating process (dgp), 72-3, 124 
misspecified, 90, 132 
data mining, 285-6 
data sets. See microdata 
data sets used in applications 
Current Population Survey Displaced Workers 
Supplement (McCall), 603-8, 632-6, 658-62 
fishing-mode choice data (Kling and Herriges), 
463-6, 486, 491-5 
National Longitudinal Survey (Kling), 110-2 
National Supported Work demonstration project 
(Dehejia and Wahba), 889-95 
Panel Survey of Income Dynamics cross-section 
sample, 295-7, 300 
Panel Survey of Income Dynamics panel sample 
(Ziliak), 708-15, 754-6 
patents-R&D panel data (Hausman, Hall, and 
Griliches), 792-5 
Rand Health Insurance Experiment expenditures, 
553-6, 565 
Rand Health Insurance Experiment medical doctor 
contacts, 671-4, 692 
strike duration data (Kennan), 574-5, 582 
Vietnam World Bank Livings Standards Survey, 
88-90, 848-53 
see also applications with data 
data structures, 39-62 
data sources, 58-9 
handling microdata, 59-61 
natural experiments, 54-8 
observational data, 40-8 
social experiments, 48-54 
data summary approach to regression, 820 
Davidon, Fletcher, and Powell (DFP) algorithm, 344, 
350-1 
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decomposition of variance, 955-6 
degenerate distribution, 948 
degrees-of-freedom adjustment, 75, 102, 138, 185-6, 
278, 841 
delta method, 231-2 
bootstrap alternative, 363 
density kernel, 421 
density-weighted average derivative (DWAD) 
estimator, 326 
dependent variable, 71 
descriptive approach to regression, 820 
deviance, 149, 244 
deviance residual, 289, 291 
DFP algorithm. See Davidon, Fletcher, and Powell 
algorithm 
dgp. See data-generating process 
diagnostic tests. See specification tests 
DID estimator. See differences-in-differences 
differences-in-differences (DID) estimator, 55-7, 
768-70, 878-9 
application, 890-1 
consistency, 770 
definition, 768 
introduction, 55—7 
natural experiments, 878 
with controls, 878-9 
without controls, 878 
direct regression, 906 
disaggregated data 
contrasted with aggregated data, 5-10 
discrete factor models, 678 
see also finite mixture models 
discrete outcomes. See binary outcomes; counts; 
multinomial outcomes 
discrete-time duration data, 577-8, 600-3 
cumulative hazard function, 578 
discrete-time proportional hazards, 600-3 
gamma heterogeneity, 620 
hazard function, 578 
logit model, 602 
ML estimator, 601 
nonparametric estimation, 581-4 
probit model, 602 
survivor function, 578 
dissimilarity parameter, 509 
disturbance term. See error term 
double bootstrap, 374 
dummy endogenous variable model, 557 
dummy variable estimator, 784-5, 800, 805, 840 
see also LSDV estimator 
duration data, 573-664 
different types, 626, 641 
duration models, 573—664 
accelerated failure time, 591-2 
applications, 574-5, 583, 589, 603-8, 632-6, 
658-62 
censoring, 579-82, 587-9, 595, 642 
competing risks, 642-8, 658-62 


cumulative hazard function, 577-8 
discrete time, 577—8, 600-3 
generalized residual, 631 
hazard function, 576, 578 
key concepts, 576-8 
mixture models, 613-25 
ML estimator, 587-9 
multiple spells, 655-8 
multivariate, 648-55 
nonparametric estimators, 580-4 
OLS estimator, 590-1 
panel data, 801-2 
parametric models, 584-91 
proportional hazards, 592-7 
risk set, 581, 594 
semiparametric estimation, 594—600, 610-2 
specification tests, 628-32 
survivor function, 576, 578 
time-varying regressors, 597—600 
see also proportional hazards model 
DWAD estimator. See density-weighted average 
derivative 
dynamic panel models, 763-8, 791-2, 797-9, 
806-7 
Arellano-Bond estimator, 765-6 
binary outcome models, 806-7 
count models, 806-7 
covariance structures, 766—7 
inconsistency of standard estimators, 764-5 
initial conditions, 764—5 
IV estimators, 764—5 
linear models, 763-8 
MD estimator, 767 
nonlinear models, 791—2, 797-9, 806-7 
nonstationary data, 767-8 
transformed ML estimator, 766 
true state dependence, 763-4 
unobserved heterogeneity, 764 
weak exogeneity, 749 


EDF bootstrap. See empirical distribution function 
bootstrap 
Edgeworth expansions, 370-1 
efficient score, 141 
Eicker-White robust standard errors, 74-5, 80-1, 112, 
137, 164, 175 
see also heteroskedasticity robust-standard errors 
EM algorithm see expectation maximization 
empirical Bayes method, 442 
empirical distribution function (EDF) bootstrap, 360 
see also paired bootstrap 
empirical likelihood, 203-6 
empirical likelihood bootstrap, 379-80 
encompassing principle, 283 
endogeneity 
definition, 92 
due to endogenous stratification, 78, 824-5 
Hausman test for, 271-2, 275-6 
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identification frameworks and strategies, 35-7 
see also endogenous regressors; exogeneity 
endogenous regressors, 78 
binary, 557, 562 
in count models, 683-4, 687-9 
in discrete outcome models, 473 
in duration models, 598 
dummy, 557, 562 
inconsistency of OLS, 95-6 
in linear panel models, 744-63 
in linear simultaneous equations model, 23-30 
in nonlinear panel models, 792 
in potential outcome model, 30-3 
returns-to-schooling example, 69-70 
in selection models, 559-62 
in single-equation models, 30 
see also GMM estimator; IV estimator 
endogenous sampling, 42-5, 78, 822-9, 856 
consistent estimation, 827-9 
leading examples, 823 
see also censored models; endogenous 
stratification; sample selection models 
endogenous stratification, 820, 826-7, 856 
equation-by-equation OLS, 210 
equicorrelated errors, 701, 722-4, 804 
equidispersion, 668, 670 
error components model. See RE model 
error components SEM, 762 
error components SUR model, 762 
error components 2SLS estimator, 760 
error components 3SLS estimator, 762 
error term, 71, 168 
additive, 168 
nonadditive, 168 
errors-in-variables. See measurement error 
estimated asymptotic variance, 954 
see also asymptotic distribution 
estimated prediction error. See cross-validation 
estimating equations estimator, 13-5 
asymptotic distribution, 134-5, 174 
clustered data, 842 
computation, 339 
definition, 134 
generalized, 134, 790, 794, 804 
variance matrix estimation, 137-9 
weighted, 829 
see also MM estimator 
Euler conditions, 171, 749 
exact identification. See just identification 
exchangeable errors, 701, 804 
exhaustive sampling, 815-6 
exogeneity, 22-3 
conditional independence, 23 
Granger causality, 22 
of instrument, 106 
overidentifying restrictions test for, 277 
panel data assumptions, 700, 748-52, 754, 
781 


strong exogeneity, 22 

weak exogeneity, 22 
exogenous sampling, 42-3 
exogenous stratified sampling, 42, 78, 814-5, 820, 

825, 856 

exogenous regressor. See exogeneity 
expectation maximization (EM) algorithm, 345-7 

for data imputation, 930-2 

E (Expectation) step, 346 

for finite mixture model, 623-5 

M (Maximization) step, 346 

compared to NR algorithm, 625 
expected elapsed duration, 626 
experimental data, 48-58 

control group, 49 

natural experiments, 54-8 

social experiments, 48-54 

treatment group, 49 
explanatory variables. See regressors 
exponential conditional mean, 124, 155, 669 

coefficient interpretation, 124, 162-3, 669 
exponential distribution, 140, 584-6 

for generalized (Cox-Snell) residual, 631 
exponential family density, 427 

conjugate prior for, 427-8 

see also linear exponential family 
exponential-gamma regression model, 616, 

633-4 

exponential-IG regression model, 634 
exponential regression model 

application with censored data, 606-8, 633 

example with uncensored data, 159-63 
extreme value distribution. See type 1 extreme value 
extremum estimator, 124-39 

asymptotic distribution, 127-31 

consistency, 125-7 

definition, 125 

formal proofs, 130-2 

informal approach, 132-3 

statistical inference, 135-9 

variance matrix estimation, 137-9 


factor analysis, 650 
factor loadings, 517, 650-1, 689 
factor model, 517, 648, 686 
Fairlee-Gumble-Morgenstern copula, 654 
fast simulated annealing (FSA) method, 347-8 
FD estimator. See first-differences 
FE estimator. See fixed effects 
feasible generalized least squares (FGLS) estimator, 
81-3 
asymptotic distribution, 82 
definition, 82 
example, 84-5 
in fixed effects model, 729 
in mixed linear model, 775 
nonlinear, 155-8 
in pooled model, 720-1 
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feasible generalized least squares (cont.) 
in random effects model, 705, 734—6, 738, 837-9, 
849-51 
as sequential two-step m-estimator, 201 
systems FGLS, 208-9 
feasible generalized nonlinear least squares (FGNLS) 
estimator, 155-8 
asymptotic distribution, 156 
definition, 156 
example, 159-63 
as optimal GMM estimator, 180-1 
systems FGNLS, 217 
FGLS estimator. See feasible generalized least squares 
FGNLS estimator. See feasible generalized nonlinear 
least squares 
FIML estimator. See full information maximum 
likelihood 
finite mixture models, 621—5 
counts, 678-9 
definition, 622 
EM algorithm, 623-5 
latent class interpretation, 623 
number of components, 624-5 
panel data, 786 
see also mixture models 
finite-sample bias 
of GMM estimator, 177 
of IV estimator, 108-12 
of tests, 250-4, 262 
finite-sample correction term 
for sampling without replacement, 817 
first-differences (FD) estimator, 704-5, 729-31 
application, 710-11, 714 
asymptotic distribution, 730-1 
compared to FE estimator, 731 
consistency, 730, 764 
definition, 704—5, 730 
IV estimator, 758 
first-differences (FD) model, 704, 729-31, 758 
first-differences (FD) transformation, 783-4 
fixed effects (FE) estimator, 704, 726-9, 756-9, 
781-5, 791-2 
application, 710-3, 792-5 
asymptotic distribution, 727-9 
binary outcome models, 796-9 
clustered data, 839-41 
compared to DID estimator, 768 
compared to FD estimator, 729 
as conditional ML estimator, 732 
consistency, 727, 764, 781-2, 784-5 
count models, 802-8 
definition, 704, 726, 781-4 
duration models, 802 
dynamic models, 764-6, 791-2, 797-9, 806-7 
as FGLS estimator, 729 
Hausman test for, 717-9 
identification, 702 
incidental parameters, 704, 726 


inconsistency, 764, 781-2, 784-5 
IV estimators, 758 
as LSDV estimator, 733 
multinomial outcome models, 798 
selection models, 801 
Tobit model, 800 
versus random effects, 701—2, 715-9, 788 
fixed effects (FE) model, 704, 726-33, 756-9, 781-5, 
791-2 
cohort-level, 772 
clustered data, 831, 843 
definition, 700, 726 
dynamic models, 764-6, 791-2, 797-9, 806-7 
endogenous regressors, 756-9 
identification, 702 
incidental parameters, 704, 726 
marginal effects, 702 
nonlinear models, 781—5, 796-808, 791 
time-varying regressors, 702 
versus random effects, 701—2, 715-9, 788 
see also fixed effects estimators 
fixed coefficient, 846 
fixed design. See fixed in repeated samples 
fixed in repeated samples, 76-7 
bootstrap sampling method, 360 
in kernel regression, 312 
Liapounov CLT, 951 
Markov LLN, 948 
Monte Carlo sampling method, 251 
fixed regressors. See fixed in repeated samples 
flexible parametric models 
count models, 674—5 
hazard models, 592 
selection models, 563 
flow sampling, 44, 626 
forward orthogonal deviations IV estimator, 759 
forward orthogonal deviations model, 759 
forward recurrence time, 626 
Fourier flexible functional form, 321 
frailty, 612, 662 
see also unobserved heterogeneity 
Frank copula, 654 
Frechet bounds, 653-4 
frequentist approach, 421-2, 424, 439-40 
FSA method. See fast simulated annealing 
full conditional distributions, 431 
see also Gibbs sampler 
full information maximum likelihood (FIML) 
estimator, 214 
nested logit model, 510-2 
nonlinear models, 219 
functional approach 
to measurement error, 901 
functional form misspecification, 91-2 
diagnostics for, 272-3, 277-8 


gamma distribution, 585-6, 614 
gamma function, 586 
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Gaussian quadrature, 389-90, 393, 809 
Gauss-Hermite quadrature, 389-90 
Gauss-Laguerre quadrature, 389-90 
Gauss-Legendre quadrature, 389-90 

Gauss-Newton (GN) algorithm, 345 
example, 348 

GEE estimator. See generalized estimating equations 

general to specific tests, 285 

generalized additive model, 323, 327 

generalized cross-validation, 315 

generalized estimating equations (GEE) estimator, 

790, 794, 804, 809 

generalized extreme value (GEV) distribution, 508 
see also nested logit model 

generalized information matrix equality, 142, 145, 264 

generalized inverse, 261 

generalized IV estimator, 187 

generalized least squares (GLS) estimator, 81-5 
asymptotic distribution, 82 
definition, 82 
as efficient GMM, 179 
example, 84-5 
nonlinear, 155-8 

generalized linear models (GLMs), 149-50, 155 
count data, 683 
conditional ML estimator, 783 
GEE estimator, 791 
quasi-ML estimator, 149-50 
see also LEF models 

generalized method of moments (GMM) estimator, 

166-222 

asymptotic distribution, 173-4, 182-3 

based on additional moment restrictions, 169, 
178-9 

based on moment conditions from economic theory, 
171 

based on optimal conditional moment, 179-80 

bootstrap for, 379-80 

computation, 339 

definition, 173 

endogenous counts, 683-4, 687-9 

with endogenous stratification, 827 

with exogenous stratification, 823-4 

examples, 167-71, 178-9 

finite-sample bias, 177 

identification, 173, 182 

linear IV, 183-92 

linear systems, 211-2 

nonlinear IV, 192-9 

one-step GMM estimator, 187, 196, 746, 755 

optimal GMM, 176 

optimal moment condition, 179-81, 188 

optimal weighting matrix, 175-6 

panel data, 744-66, 789-90, 792 

practical considerations, 219-20 

test based on, 245 

two-step, 176, 187, 746, 755 

variance matrix estimation, 174—5 


weak instruments, 177-8 
see also panel GMM estimator 
generalized nonlinear least squares (GNLS) estimator. 
See feasible generalized nonlinear least squares 
generalized partially linear model, 323 
generalized random utility models, 515-6 
generalized residual, 289-90 
in duration models, 631 
in LM test, 239-40 
plots of, 633-6 
generalized Tobit model, 548 
generalized Weibull distribution, 584-6 
genetic algorithms, 341 
GEV distribution. See generalized extreme value 
Geweke, Hajivassiliou, Keane (GHK) simulator, 
407-8 
for MNP model, 518 
GHK simulator. See Geweke, Hajivassiliou, Keane 
simulator 
Gibbs sampler, 448-50. 
data augmentation, 454-5, 933 
example, 452-4 
in latent variable models, 514, 519, 563 
see also Markov chain Monte Carlo 
GLMs. See generalized linear models 
GLS estimator. See generalized least squares 
GMM estimator. See generalized method of moments 
GN algorithm. See Gauss-Newton 
GNLS estimator. See feasible generalized nonlinear 
least squares 
Gompertz distribution, 585-6 
Gompertz regression model, 606-8 
gradient methods, 337-48 
see also iterative methods 
Granger causality, 22 
grid search methods, 337, 351 
grouped data. See aggregated data 


Halton sequences, 409-10 
Hausman test, 271-4 
applications, 719, 850-1 
asymptotic distribution, 272 
auxiliary regressions, 273 
bootstrap, 378 
computation, 272-3, 378, 717-9 
definition, 271-2 
for endogeneity, 271-2, 275-6 
for fixed effects, 717—9, 737, 788, 839 
for multinomial logit model, 503 
power, 273-4 
robust versions, 273, 378, 718-9 
Hausman-Taylor IV estimator, 761 
Hausman-Taylor model, 760-2 
Hawthorne effect, 53 
hazard function 
baseline in PH model, 591 
cumulative hazard, 577-8, 582-4 
definition, 576, 578 
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hazard function (cont.) hierarchical models, 429 

in mixture models, 616-8 Bayesian analysis, 441-2, 447, 450, 514 

multivariate, 649 see also hierarchical linear models 

nonparametric estimator, 581, 583 histogram, 298 

parametric examples, 585 see also kernel density estimator 

piecewise constant, 591 HLM. See hierarchical linear model 

see also duration models hot deck imputation, 929, 940 
Health and Retirement Study (HRS), 58 HRS. See Health and Retirement Study 
Heckit estimator. See Heckman two-step estimator Huber-White robust standard errors, 137, 144, 146 
Heckman two-step estimator see also robust standard errors 

application, 554 hurdle model, 680-1, 690 

in Roy model, 556 see also two-part model 

in selection model, 550-1 hyperparameters, 428, 847 

semiparametric estimator, 565-6 hypothesis tests, 223-58 

in Tobit model, 543, 567-8 based on extremum estimator, 224—33 
Hessian matrix based on ML estimator, 233-43 

estimate, 137 based on GMM estimator, 245 

Newton-Raphson algorithm, 341-2 based on m-estimator, 244 

singular, 350-1 bootstrap, 254-6, 363-8, 372-3, 378-9 
heterogeneous treatment effects, 882, 885-7 for common misspecifications, 274-7, 670-1 

IV estimator, 886-7 examples, 236, 241-3, 252-4, 254-6, 372-3 

LATE estimator, 885 induced test, 230 

RD design, 882 joint versus separate, 230-1, 285, 629-30 
heterogeneity power, 247-50, 253-4 

within-cell, 480 size, 246-7, 251-3 

see also unobserved heterogeneity see also LM tests; LR test; Wald tests, m-tests 
heteroskedastic errors 

adaptive estimation, 323, 328 identification 

conditional heteroskedasticity, 78 in additive random utility models, 504 

definition, 78 in binary outcome models, 476, 483 

in GLMs, 149-50 bounds identification, 29 

in linear model, 84-5, 94—5 definitions, 29-31 

multiplicative, 84-5, 86-7 in fixed effects model, 702 

in nonlinear model, 157-63 of GMM estimator, 173, 182 

residuals, 289-90 just identification, 31, 214 

tests for, 241, 267, 275 in linear regression model, 71-2 

Tobit MLE inconsistency, 538 in measurement error models, 905-14 

working matrix for, 82-3, 156-8 in mixture models, 618-20 
heteroskedasticity-robust standard errors in multinomial probit model, 517 

bootstrap, 379-80 in natural experiments, 57-8 

clustered data, 834 observational equivalence, 29 

example, 84-5 order condition, 31, 213 

for extremum estimator, 137, 164 over identification, 31, 214 

intuition, 81 rank condition, 31 

for NLS estimator, 155, 164 in sample selection model, 551, 565, 566 

for OLS estimator, 74—5, 80-1, 112 set identification, 29 

panel data, 705 in simultaneous equations model, 29-31, 213-4 

for WLS estimator, 83 in single-index models, 325 

see also robust standard errors and singular Hessian, 351 
hierarchical linear models (HLMs), 845-8 weak identification, 100 

Bayesian analysis, 847 see also identification strategies 

clustered data, 845 identification strategies, 36-7 

coefficient types, 846-7 control function approach, 37 

individual-specific effects, 848 exogenization, 36 

mixed linear models, 774—6, 847 incidental parameter elimination, 36-7 

panel data, 847-8 instrumental variables, 37 

random coefficients model, 847 matching, 37 

two-level model, 846 reweighting, 37 
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identified reduced form, 36 
IG distribution. See inverse-Gaussian 
ignorable missingness, 927 

estimator consistency if MCAR, 927 

estimator inconsistency if MAR only, 927 

problems if nonignorable, 940 

weak exogeneity, 927 
ignorability assumption, 863 

see also conditional independence assumption 
importance sampling, 407-8, 443-5, 518 

accelerated, 409 

GHK simulator, 407-8 

importance sampling density, 444 

importance sampling estimator, 444 

importance weight, 445 

target density, 444 
imputation methods, 928-39 

data augmentation, 454-5, 932-4 

example, 936-8 

hot deck imputation, 929 

listwise deletion, 928 

mean imputation, 928-9 

multiple imputation, 934-5 

pairwise deletion, 928 

regression-based imputation, 930-2 
imputation (I) step, 455, 932 
IM test. See information matrix test 
IMSE. See integrated mean squared error 
incidental parameters, 36 

clustered data FE model, 832, 840, 844 

panel data FE model, 704, 726, 781-2, 805 
inclusive value, 510-1 
incomplete gamma function, 586 
incomplete panels. See unbalanced panels 
independence of irrelevant alternatives, 503, 505, 527 
independent variables. See regressors 
independently-weighted IV estimator, 192 
independently-weighted optimal GMM estimator, 177 
index function model 

binary outcome model, 475-6, 482-3 

bivariate probit model, 522-3 

ordered multinomial model, 519-20 

Tobit model, 536 

see also single-index model 
indicator function, 298 
indirect inference, 404—5 
individual-specific effects model 

additive, 780 

binary outcome models, 795-6 

cluster-specific effects, 830 

count models, 802-3 

definitions, 700, 780 

duration models, 802 

multiplicative, 780, 793 

one-way, 700 

parametric, 780 

selection models, 801 

single-index, 780 


Tobit models, 800-1 
two-way, 738 
see also FE models; RE models 
induced test, 230 
information criteria, 278-9, 283-4 
Akaike, 278-9, 284, 624 
Bayesian, 278, 284 
consistent Akaike, 278 
Kullback-Liebler, 147, 169, 278, 280 
Schwarz, 278, 284 
information matrix, 142 
block-diagonal, 144, 240, 329 
information matrix equality, 141-2, 145 
generalized, 142, 145 
see also BHHH estimate; OPG version 
information matrix (IM) test, 265-6 
bootstrap, 378 
computation, 261-2, 378 
definition, 265 
example, 270 
power, 267 
instrumental variables (IV) estimator 
alternative estimators, 190-2 
application, 110-2 
definition, 100-1 
example, 102-3 
finite-sample bias, 108-12, 191-2, 196 
identification, 100, 105-7 
independently-weighted IV estimator, 192 
jackknife IV estimator, 192 
LIML estimator, 191, 214 
in linear model, 98—112, 183-92, 211-2 
linear IV as GMM estimator, 170, 186 
local average treatment effects estimator, 883-9 
in measurement error models, 908-10, 912-3 
in natural experiments, 54-5 
in nonlinear models, 192-9 
in panel models, 764-5, 757-61 
quantile regression, 190 
in selection models, 559 
split-sample estimator, 191-2 
systems IV estimator, 211-2, 218-9 
in treatment effects models, 883-9 
two-stage IV estimator, 102, 187 
two-stage least squares estimator, 101-2, 187-91 
Wald estimator, 98—9 
see also GMM estimator; panel GMM estimator 
instruments 
definition, 96-7, 100 
examples, 97-8 
by exclusion restriction, 106 
by functional form restriction, 106 
invalid, 100, 105-7 
optimal, 180 
for panel data, 750-1, 754-6 
relevance, 108 
weak, 100, 104-12, 177-8, 191-2, 196, 751-2, 756 
see also instrumental variables estimator 
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integrated hazard function. See cumulative hazard 


function 

integrated mean squared error (IMSE), 303 
integrated squared error (ISE), 302, 314 
interval data models 

definition, 532-3, 579 

ML estimator, 534—5 
interruption bias, 626 
intraclass correlation, 816, 831, 835-8 
inverse-Gaussian (IG) distribution, 614—5, 677 
inverse law of probability, 421 
inverse-Mills ratio, 540-1, 553-4 
inverse transformation method, 409, 412-3 
inverse-Wishart distribution, 443, 453, 514 
irrelevant regressors, 93 
ISE. See integrated squared error 
iterated bootstrap, 374 
iterative methods, 337-48 

BFGS, 344 

BHHH, 343-4 

convergence criteria, 339-40 

DFP, 344, 350-1 


expectation maximization, 345-7, 623-5, 930-2 


fast simulated annealing, 347-8 
Gauss-Newton, 345, 348 
line search, 338 
Newton-Raphson, 338-9, 341-3, 348 
numerical derivatives, 340 
simulated annealing, 347 
starting values, 340, 351 
step size adjustment, 338 
IV estimator. See instrumental variables 


jackknife, 374-6 
bias estimate, 375 
bias-corrected estimator, 375 
example, 376 
IV estimator, 192 
standard error estimate, 375, 855 
Jensen’s inequality, 956 
jittered data, 290 
joint duration distributions, 648-55 
copulas, 651-5 
mixtures, 650-1 
multivariate hazard function, 649 
multivariate survivor function, 649-50 
joint limits, 767 
joint versus separate tests, 230-1, 285, 629-30 
just identification, 31, 100, 173 


Kaplan-Meier (KM) estimator, 581-3 
application, 575, 583, 604-5 
for baseline hazard, 596-7 
confidence bands for, 583 
definition, 581 
tied data, 582 

kernel density estimator, 298-306 
alternatives to, 306 


application, 296-7, 300 

asymptotic distribution, 301-2, 330-1 
bandwidth choice, 302-4 

bias, 301, 330-1 

confidence interval for, 305 
consistency, 300 

convergence rate, 302 

definition, 299 

derivative estimator, 305 

examples, 252-3, 367-8 

multivariate, 305-6 

Nadaraya- Watson kernel regression estimator, 312 
optimal bandwidth, 303 

optimal kernel, 303 

variance, 301, 331 


kernel functions, 299-300 


comparison, 300 

definition, 299 

higher-order, 299, 306, 313 
leading examples, 300 

optimal for density estimation, 303 
properties, 299 


kernel matching, 875, 895-6 
kernel regression estimator, 311-9 


alternatives to, 319-22 

asymptotic distribution, 313, 331-3 
bandwidth choice, 314-6 

bias, 313, 331-2 

bootstrap confidence interval for, 380-1 
boundary problems, 309, 320-1 
conditional moment estimator, 317-8 
confidence interval for, 316 
consistency, 313 

convergence rate, 314 

definition, 312 

derivative estimator, 317 

introduction to nonparametric regression, 307-11 
multivariate, 318-9 

optimal bandwidth, 314 

optimal kernel, 314 

undersmoothing, 380 

variance, 301, 331 

see also nonparametric regression 


Khinchine’s theorem, 948 

KLIC. See Kullback-Liebler information criterion 
KM estimator. See Kaplan-Meier 

k-NN estimator. See nearest neighbors estimator 
Kolmogorov LLN, 80, 111, 947 

Kolmogorov test, 267 

Kullback-Liebler information criterion (KLIC), 147, 


169, 278, 280 


LAD estimator. See least absolute deviations 
Lagrange multiplier (LM) test 
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comparison with LR and Wald tests, 238-9 
computation, 239-41, 256, 274 
definition, 234—5 
examples, 236, 241-3 
for heteroskedasticity, 241, 267, 275 
in duration models, 632 
interpretation, 239-40 
for omitted variables, 274 
OPG version, 240-1 
for random effects, 737, 841 
score test, 234-5 
in Tobit model, 544 
for unobserved heterogeneity, 630, 636 
see also hypothesis tests 
Laplace approximation, 390 
Laplace distribution, 178, 541 
Laplace transform, 577 
LATE estimator. See local average treatment effects 
latent class model, 622 
see finite mixture models 
latent variable, 475, 532 
latent variable models 
additive random utility model, 476-8, 504-7 
binary outcomes, 475-8 
endogenous, 560-1 
ordered multinomial model, 519-20 
see also censored models; truncated models 
law of iterated expectations, 955 
law of large numbers (LLN), 947-8 
definition, 947 
examples of use, 80, 129 
Khinchine’s theorem, 948 
Kolmogorov LLN, 947 
Markov LLN, 948 
sampling schemes, 131, 948 
strong law, 947 
weak law, 947 
least absolute deviations (LAD) estimator 
application, 88—90 
asymptotic distribution, 88 
binary outcome models, 484 
bootstrap, 381 
censored LAD, 564-5, 808 
definition, 87 
two-stage LAD, 190 
see also quantile regression 
least-squares dummy variable (LSDV) estimator, 704, 
732-3, 840 
least-squares dummy variable (LSDV) model, 704, 
732, 840 
least squares (LS) estimators 
clustered data, 833-7 
feasible generalized LS, 81-3, 155-8 
generalized LS, 81-5, 155-8 
linear, 70-85 
nonlinear LS, 150-9 
ordinary LS, 70-81 
panel data, 211, 702-3, 720-5 


systems of equations, 207-8, 211, 217 
see also FGLS; FGNLS; OLS; NLS 
leave-one-out estimate, 192, 304, 315, 375 
LEF. See linear exponential family 
length-biased sampling, 43-4, 626 
Liapounov CLT, 80, 131, 950 
likelihood-based hypothesis tests, 233-43 
comparisons of, 235-6, 238-9 
definitions, 234—5 
examples, 236-7, 241-3 
see also LM tests; LR tests; Wald tests 
likelihood function, 139-41 
conditional likelihood function, 139, 731-2, 824 
definition, 139 
joint, 19, 824-7 
leading examples, 140-1 
marginal, 432, 595 
partial, 594-6 
likelihood principle, 139, 420, 433 
likelihood ratio (LR) test 
asymptotic distribution, 235, 237 
based on GMM-estimator, 245 
based on m-estimator, 244 
comparison with LM and Wald tests, 238-9 
definition, 234 
examples, 236, 241-3 
nonnested models, 279-83 
quasi-LR test statistic, 244 
uniformly most powerful test, 237 
see also hypothesis tests 
LIML estimator. See limited information maximum 
likelihood 
limit distribution, 948 
see also asymptotic distribution 
limit variance matrix, 952-3 
definition, 952 
replacement by consistent estimate, 952 
sandwich form, 953 
limited information maximum likelihood (LIML) 
estimator, 191, 214 
Lindeberg-Levy CLT, 80, 131, 950 
line search, 338 
linear exponential family (LEF) models, 147-9 
conjugate priors, 427-8 
conditional ML estimator, 782 
consistency, 148 
leading examples, 148 
pseudo-R?, 288 
residuals, 289-90 
tests based on, 240, 268, 274-5 
see also generalized linear models 
linear panel estimators, 695-778 
application, 708-15, 725 
Arellano-Bond estimator, 764—5 
between estimator, 703 
covariance estimator, 733 
conditional ML estimator, 731-2 
differences-in-differences estimator, 768—70 
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linear panel estimators (cont.) 
error components 2SLS estimator, 760 
error components 3SLS estimator, 762 
first differences estimator, 704—5, 729-31 
first differences IV estimator, 758 
fixed effects estimator, 704, 726-9 
fixed effects IV estimators, 757-9 


forward orthogonal deviations IV estimator, 759 


Hausman-Taylor IV estimator, 761 

LSDV estimator, 704, 732-3 

MD estimator, 753, 76-7 

panel bootstrap, 708, 377-8, 708, 746, 751 
panel GMM estimators, 744-68 


panel-robust inference, 705-8, 722, 745-6, 751 


pooled OLS estimator, 702-3, 720-5 
random effects estimator, 705, 734-6 
random effects IV estimator, 759-60 
within estimator, 704, 726-9 
within IV estimator, 758 

linear panel models, 695-778 
analysis-of-covariance model, 733 
application, 708-15, 725 
between model, 702 
dynamic models, 763-8 
endogenous regressors, 744-63 
first differences model, 704, 730, 758 
fixed effects model, 700-2, 726-34, 757-9 
fixed versus random effects, 701-2, 715-9 
forward orthogonal deviations model, 759 
Hausman-Taylor model, 760-2 
incidental parameters problem, 704, 726 
individual dummies, 699 
individual-specific effects model, 700 
LSDV model, 704, 732 
minimum distance estimator, 753, 766-7 
mean-differenced model, 758 
measurement error, 739, 905 
mixed linear models, 774-6 
pooled model, 699, 720-5 
random effects differenced model, 760-1 
random effects model, 700-2, 734-6, 759-60 
residual analysis, 714-5 
strong exogeneity, 700, 749-50, 752 
time dummies, 699 
time-invariant regressors, 702, 749-51 
time-varying regressors, 702, 749-51 
two-way effects model, 738 
unbalanced data, 739 
weak exogeneity, 749, 752, 758 
within model, 704, 758 
see also linear panel estimators 

linear probability model, 466-7 

linear programming methods, 341 

linear regression model 
definition, 16-17, 70-1 

linear systems of equations, 207-14 
panel data models as, 211 
seemingly unrelated regressions, 209-10 


simultaneous equations, 22-31, 213-4 

systems FGLS estimator, 208 

systems GLS estimator, 208 

systems GMM estimator, 208 

systems ML estimator, 214 

systems OLS estimator, 211 

systems 2SLS estimator, 212 
linearization method, 855 
link function, 149, 469, 783 
listwise deletion, 60, 928 

consistency under MCAR, 928 

example, 936-8 

inconsistency under MAR only, 928 
Living Standards Measurement Study (LSMS), 59, 

88-90, 848-53 
LLN. See law of large numbers 
LM test. See Lagrange multiplier test 
local alternative hypotheses, 238, 247-8, 254 
local average treatment effects (LATE) estimator, 
883-9 

assumptions, 884—5 

comparison with IV estimator, 885 

definition, 884 

heterogeneous treatment effect, 885 

monotonicity assumption, 885 

selection on unobservables, 883 

Wald estimator, 886 

see also ATE; ATET; MTE 
local linear regression estimator, 320-1, 333 
local polynomial regression estimator, 320-1 
local running average estimator, 308, 320 
local weighted average estimator, 307-8 
logistic distribution, 476-7 
logistic regression. See logit model 
logit model, 469-70 

application, 464-5 

as ARUM, 477, 486-7 

clustered data, 844 

definition, 469 

for discrete-time duration data, 602 

GLM, 149 

imputation example, 937-9 

index function model, 476 

marginal effects, 470 

measurement error example, 919 

ML estimator, 468-9 

multinomial logit, 494-5, 500-3, 525 

nested logit, 509-12, 526-7 

ordered logit, 520 

panel data, 795-9 

probit model comparison, 471-3 

random parameters logit, 512-6 

see also binary outcome models 
log-likelihood function. See likelihood function 

length-biased sampling, 43-4 
log-logistic distribution, 585-6, 592 
log-normal distribution, 585-6, 592 
log-normal model, 533, 545-6 
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log-odds ratio, 470, 472 
log-sum, 510 
log-Weibull distribution. See type 1 extreme value 
long panel, 723-5, 767 
longitudinal data. See panel data 
loss function, 66-69 
absolute error, 67 
asymmetric expected error, 67 
Bayesian decision analysis, 434-5 
expected, 66 
KLIC, 68, 147, 168, 278-9 
squared error, 67-9, 156 
step, 67-8 
Lowess regression estimator, 320-1 
application, 297, 309-10, 712-5 
LR test. See likelihood ratio test 
LS estimators. See least squares 
LSDV. See least-squares dummy variable 
LSMS. See Living Standards Measurement Study 


MAR. See missing at random 
marginal analysis of panel data, 717, 787 
marginal effects, 122-4 
in binary outcome models, 466-5, 467, 470-1 
calculus method, 123 
computing, 122—4 
definition, 122 
example, 162-3 
finite-difference method, 123 
in fixed effects model, 702, 788 
in multinomial models, 493-4, 501-3, 519-23, 525 
population-weighted, 821 
in sample selection models, 552 
in single-index models, 123 
in Tobit model, 541-2 
see also coefficient interpretation 
marginal likelihood, 432, 595 
marginal treatment effects (MTE) estimator, 886 
market-level data, 482, 513 
Markov chain Monte Carlo (MCMC) methods, 
445-54 
convergence, 449, 458 
in data augmentation, 933 
examples, 452-4, 512, 687, 936-9 
Gibbs sampler, 448-50, 514, 519, 563 
Metropolis algorithm, 450-1 
Metropolis-Hastings algorithm, 451-2, 512 
Markov LLN, 77, 131, 948 
Marshall-Olkin method, 649-51, 686 
matching assumption, 864 
see also overlap assumption 
matching estimators, 871-8, 889-96 
application, 889-96 
assumptions, 863-5 
ATE matching estimator, 877 
ATET matching estimator, 874, 877, 894-6 
balancing condition, 893 
caliper matching, 874 


counterfactuals, 871 

exact matching, 872, 891 

inexact matching, 873 

interval matching, 875-6 

kernel matching, 875, 895-6 

nearest-neighbor matching, 875, 894-6 

propensity score matching, 873-8, 892 

radius matching, 876, 895-6 

selection on observables only, 871 

stratification matching, 875-6, 893-6 

variance computation, 877-8, 895 
maximum empirical likelihood (MEL) estimator, 206 
maximum likelihood (ML) estimator, 139-46 

asymptotic distribution, 142-3 

conditional ML estimator, 731—2, 782-3, 796-9 

consistency, 142, 824 

definition, 141 

endogenous stratification, 824-7 

example, 143-4 

exogenous stratification, 824 

MSL estimator, 393-8 

quasi-ML estimator, 146-50 

regularity conditions, 141, 145-6 

restricted, 233 

unrestricted, 233 

variance matrix estimation, 144 

weighted ML estimator, 828 

see also quasi-ML estimator 
maximum rank correlation estimator, 485 
maximum score estimator, 341, 381, 483-4, 800 
maximum simulated likelihood (MSL) estimator, 

393-8 

asymptotic distribution, 394-5 

bias-adjusted MSL, 396-7 

compared to MSM, 402-3 

count model examples, 677-8, 687, 689 

definition, 394 

example, 397-8 

multinomial probit model, 518 

number of simulations, 396 

random parameters logit model, 522 
MCAR. See missing completely at random 
MD estimator. See minimum distance estimator 
mean-differenced estimator, 783, 805-6 
mean-differenced model, 758, 783 
mean imputation, 928, 936-8 
mean integrated squared error (MISE), 303, 314 
mean-scaling estimator, 783, 805-6 
mean-square convergence, 946 
mean substitution. See mean imputation 
measurement error 

in cohort-level data, 772-3 

in dependent variable, 913-4 

in microdata, 46, 60 

in panel data, 739, 905 

in regressors, 899-922 

see also measurement error model estimators; 

measurement error models 


1021 


SUBJECT INDEX 


measurement error model estimators, 899-922 
attenuation bias, 903-5, 911, 915, 919-20 
bounds identification, 906-8 
corrected score estimator, 916-8 
IV estimator, 908-10, 912-3 
linear models, 900-11 
nonlinear models, 911-20 
OLS estimator inconsistency, 902-4 


Metropolis algorithm, 450-1 
Metropolis-Hastings algorithm, 451-2, 512 
microdata sets, 58-61 

handling, 59-61 

leading examples, 58-9 
microeconometrics overview, 1—17 
midpoint rule, 388, 391-2 
minimum chi-square estimator, 203 


using additional moment restrictions, 909-10 
using instruments, 908-9 


see also Berkson’s minimum chi-square estimator 
minimum distance (MD) estimator, 202-3, 753, 766-7 


using known measurement error variance, 902-3, 
910 
using replicated data, 910-1, 913 
using validation sample, 911 
measurement error models, 899—922 
attenuation bias, 903-5, 911, 915, 919-20 
classical measurement error model, 901—2 
dependent variable measured with error, 913-4 
examples, 919-20 
identification, 905-14 
linear models, 900-11 
multiple regressors, 904 
nonclassical measurement error, 904, 920 
nonlinear models, 911-20 
panel models, 905 
scalar regressor, 903 
serial correlation, 909 
variance inflation, 904, 916 
see also measurement error model estimators 
median regression. See LAD estimator 
MEL. See maximum empirical likelihood 
m-estimator, 118-22 
asymptotic distribution, 120 
clustered data, 842-3 
definition, 118-9 
sequential two-step, 200-2 
simulated m-estimator, 398-9 
tests based on, 244, 263-4 
weighted m-estimator, 829, 856 
see also extremum estimators 
method of moments (MM) estimator 
asymptotic distribution, 134, 174 
definition, 172 
examples, 167 
see also estimating equations estimator; GMM 
estimator 
method of scoring, 343, 348 
method of simulated moments (MSM) estimator, 
399-404 
asymptotic distribution, 400-2 
compared to MSL, 402-3 
definition, 400 
example, 403 
MNP model, 497, 518 
number of simulations, 399 
method of simulated scores (MSS) estimator 
for MNP model, 519 
method of steepest ascent, 344 


asymptotic distribution, 202 
bootstrap for, 379-80 
covariance structures, 766—7 
definition, 202 
equally-weighted, 202 
generalized, 222 
indirect inference, 404—5 
OIR test, 203 
optimal, 202, 753 
panel data, 753, 766-7 
relation to GMM, 203, 753 
misclassification, 914 
MISE. See mean integrated squared error 
missing at random (MAR), 926-7 
definition, 926 
and ignorable missingness, 927, 932 
relation to MCAR, 927 
missing completely at random (MCAR), 
926-7 
definition, 927 
and ignorable missingness, 927 
relation to MCAR, 927 
missing data, 923-41 
deletion methods, 928 
examples, 924 
ignorable assumption, 927 
imputation with models, 929-41 
imputation without models, 928-9 
MAR assumption, 926-7 
MCAR assumption, 927 
nonignorable missingness, 927, 940 
see also imputation methods 
misspecification tests. See specification tests 
mixed estimator, 439-41 
mixed linear model, 774-6 
Bayesian methods, 775 
FGLS estimator, 775 
fixed parameters, 774 
ML estimator, 776 
random parameters, 774 
restricted ML estimator, 776 
nonstationary panel data, 767-8 
prediction, 776 
see also hierarchical linear model 
mixed logit model, 500-3 
example, 495 
definition, 500 
see also RPL model 
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mixed proportional hazards (MPH) model, 
611-25 
Weibull-gamma mixture, 615 
see also mixture models 
mixture hazard function, 616-8 
mixture models, 611-25 
application, 623-6 
counts, 675-9 
durations, 611-25 
identification, 618-20 
MSL estimator, 393-8, 687 
multinomial outcomes, 515-6 
multiplicative heterogeneity, 613 
specification tests, 628-32 
see also finite mixture models; unobserved 
heterogeneity 
ML estimator. See maximum likelihood 
MM estimator. See method of moments 
MNL estimator. See multinomial logit 
MNP estimator. See multinomial probit 
model diagnostics, 287-91 
binary outcome models, 473-4 
duration models, 628-32 
example, 290-1 
multinomial outcome models, 499 
pseudo-R? measures, 287-9, 291 
residual analysis, 289-91 
see also model selection methods 
model misspecification, 90—4 
see also endogeneity; functional form 
misspecification; heterogeneity; omitted values; 
pseudo-true value 
model selection methods 
Bayesian, 456-8 
nested models, 278-81 
nonnested models, 278-84 
order of testing, 285 
see also model diagnostics; specification tests 
moment-based simulation estimators, 
398-404 
see MSL estimator; MSM estimator 
moment-based tests. See m-tests 
moment matching. See indirect inference 
Monte Carlo integration, 391-2 
direct, 391 
example, 392 
importance sampling, 407, 443-5 
simulators, 393-4, 406-10 
see also quadrature 
Monte Carlo studies, 250-4 
example, 251-4 
moving average estimator, 308 
moving blocks bootstrap, 373, 381 
MPH model. See mixed proportional hazards 
MSL estimator. See maximum simulated likelihood 
MSM estimator. See method of simulated moments 
MSS estimator. See method of simulated scores 
MTE. See marginal treatment effects 


m-tests, 260-71 

asymptotic distribution, 260, 263 

auxiliary regressions, 261-3 

bootstrap, 261, 379 

chi-square goodness of fit, 266-7, 270-1, 

474 

conditional moment test, 264—5, 267-9, 319 

CM test interpretation, 268 

computation, 261-3 

definition, 260 

Hausman test, 271-4, 717-9 

information matrix tests, 265-6, 270 

outer-product-of-the-gradient form, 262 

overidentifying restrictions test, 181, 183, 267, 

747 

power, 268 

rank, 261 
multicollinearity, 350-1 

in multinomial probit model, 517 

in panel model, 752 

in sample selection model, 542, 551 
multilevel models. See hierarchical models 
multinomial logit (MNL) model, 500-3, 525 

application, 494-5 

as additive random utility model, 505 

definition, 500 

marginal effects, 494, 501-3, 525 

ML estimator, 501 

panel data, 798 

see also multinomial outcome models 
multinomial outcome models, 490-528 

application, 491-5 

alternative-invariant regressors, 498 

alternative-varying regressors, 497 

conditional logit, 500-3, 524-5 

definition, 496-7 

identification, 504 

index function model, 519-20 

marginal effects, 501-3, 524-5 

mixed logit, 500-3 

ML estimator, 496, 501 

multinomial logit, 500-3, 525 

multinomial probit, 516-9 

ordered models, 519-20 

OLS estimator, 471 

panel data, 798 

random parameters logit, 512-6 

random utility model, 504-7 

semiparametric estimation, 523-4 
multinomial probit (MNP) model, 516-9 

Bayesian Methods, 519 

definition, 516-7 

identification, 517 

ML estimator, 518 

MSL estimator, 518 

MSM estimator, 518 

MSS estimator, 518 

see also multinomial outcome models 
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multiple duration spells, 655-8 
fixed effects, 656 
lagged duration dependence, 657 
ML estimator, 658 
random effects, 657 
recurrent spells, 655 
multiple imputation, 934-9 
estimator, 934 
examples, 935-9 
relative efficiency, 935 
variance of estimator, 934—5 
multiple treatments, 860 
multiplicative errors 
multistage surveys, 41-2, 814-6, 853-6 
variance estimation, 853 
multivariate data 
binary outcomes, 521-3 
counts, 685-7 
durations, 640-64 
see also systems of equations 
multivariate-t distribution, 442 


NA estimator. See Nelson-Aalen 

National Longitudinal Survey (NLS), 58, 110-2 

National Longitudinal Survey of Youth (NLSY), 
58-9 

National Supported Work (NSW) demonstration 
project, 889-95 

natural conjugate pair, 427-8 

natural experiments, 32, 54-8 

definition, 54 


differences-in-differences estimator, 55—7, 768-70, 


878-9 

examples, 54 

exogenous variation, 54—5 

identification, 57-8 

instrumental variables, 54—5 

regression discontinuity design, 879-83 
ncp. See noncentrality parameter 
nearest neighbors (k-NN) estimator, 319-20 

definition, 319 

example, 308-9 

symmetrized, 308, 320 

see also nonparametric regression 
nearest-neighbor matching, 875, 894-6 
negative binomial distribution, 675 
negative binomial model, 675-7 

application, 690 

bivariate, 215, 686-7 

hurdle model, 681 

ML estimator, 677 

MSL estimator, 677-8 

NB1 variant, 676 

NB2 variant, 676 

panel data, 804, 806 
negative hypergeometric distribution, 806 
neglected heterogeneity. See unobserved 

heterogeneity 


Nelson-Aalen (NA) estimator, 582-4 

application, 605-6, 662 

confidence bands for, 584 

definition, 582 

tied data, 582 
nested bootstrap, 374, 379 
nested logit model, 507-12, 526-7 

from ARUM, 526-7 

definition 510-1 

different versions of, 511-2 

example, 511 

GEV model, 508, 526 

ML estimator, 510 

sequential estimator, 510 

welfare analysis, 510 

see also multinomial models 
nested models 278, 281 

see also nonnested models 
neural network models, 322 
Newey-West robust standard errors, 137, 175, 

723 

definition, 175 

see also robust standard errors 
Newton-Raphson (NR) method, 341-3 
examples, 338-9, 348 
NLFIML estimator. See nonlinear full-information 
maximum likelihood 
NLS estimator. See nonlinear least squares 
NLSY. See National Longitudinal Survey of Youth 
NL2SLS estimator. See nonlinear two-stage least 
squares 
NL3SLS estimator. See nonlinear three-stage least 

squares 

noise-to-signal ratio, 903 
noncentral chi-square distribution, 248 
noncentrality parameter (ncp), 248 
nonclassical measurement error, 904, 920 
nongradient methods, 337, 341, 347-8 
nonignorable missingness, 927, 940 

attrition bias due to, 940 

selection bias due to, 927, 932, 940 
nonlinear estimators 


coefficient interpretation, 122-4 

extremum estimator 

m-estimator, 118-22 

GMM estimator, 166-222 

ML estimator, 139-46 

NLS estimator, 150-9 

overview, 117—22 

panel models, 779-810 
nonlinear full-information maximum likelihood 

(NLFIML) estimator, 219 

nonlinear GMM estimator, 192-9 

asymptotic distribution, 194-5 

definition, 194—5 

example, 197-8, 199, 688 

instrument choice, 196 

NL2SLS estimator, 196 
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optimal, 195 
panel data, 789-90 
nonlinear in parameters, 27 
nonlinear in variables, 27 
nonlinear IV estimator. See nonlinear GMM 
nonlinear least squares (NLS) estimator, 150-9 
asymptotic distribution, 152-4 
consistency, 152-3 
definition, 151 
example, 155, 159-64 
time series, 158-9 
variance matrix estimation, 154—5 
nonlinear panel estimators, 779-8 10 
application, 792-5 
conditional ML estimator, 781-2, 805 
dummy variable estimator, 784-5, 800, 805 
first-differences estimator, 783-4 
fixed effects estimator, 783-5, 794, 796-802, 805-8 
GEE estimator, 790, 794, 804 
mean-differenced estimator, 783, 805-6 
mean-scaling estimator, 783, 805-6 
ML estimator, 785-6 
NLS estimator, 787, 794 
panel GMM estimator, 789-90 
panel-robust inference, 788-91 
quadrature, 785-6, 796, 800 
quasi-differenced estimator, 783-4 
quasi-ML estimator, 791 
random effects estimator, 785—6, 794-6, 800-1, 
803-4 
selection models, 801 
semiparametric, 808 
nonlinear panel models, 779-810 
application, 792-5 
binary outcome models, 795-6 
conditional mean models, 780-1 
count models, 792-5, 802-6 
dynamic models, 791-2, 797-9, 806-7 
endogenous regressors, 792 
exogeneity assumptions, 781 
finite mixture models, 786 
fixed effects models, 781—5, 791-2 
fixed versus random effects, 788 
incidental parameters problem, 781-2, 805 
individual-specific effects models, 780-1 
parametric models, 780, 782-3, 785-7, 792 
pooled models, 787, 794 
random effects models, 785—6, 792 
selection models, 801 
semiparametric, 808 
Tobit models, 800-1 
transition models, 801-2 
nonlinear regression model, 151 
additive error, 168, 193, 217 
nonadditive error, 168, 193, 218 
nonlinear systems of equations, 214-9 
additive errors, 217 
copulas, 651-5 


mixtures, 650-1 
ML estimator, 215-6 
NLFIML estimator, 219 
NL3SLS estimator, 219 
nonadditive errors, 217-8 
nonlinear panel model, 216 
nonlinear SUR model, 216 
quasi-ML estimator, 150 
seemingly unrelated regressions, 216 
simultaneous equations, 219 
systems FGNLS estimator, 217 
systems GMM estimator, 219 
systems IV estimator, 218-9 
systems MM estimator, 218 
systems NLS estimator, 217 
nonlinear three-stage least squares (NL3SLS) 
estimator, 219 
nonlinear two-stage least squares (NL2SLS) estimator 
asymptotic distribution, 195-6 
definition, 195-6 
example, 199 
see also nonlinear GMM estimator 
nonnested models 
Cox LR test, 279-80 
definition, 278 
example, 283-4 
information criteria comparison, 278-9 
overlapping, 281 
strictly nonnested, 281 
Vuong LR test, 280-3 
nonparametric bootstrap. See paired bootstrap 
nonparametric density estimation. See kernel density 
estimator 
nonparametric maximum likelihood (NPML) 
estimator, 622 
nonparametric regression, 307-22 
convergence rate, 311, 314 
kernel, 311-9 
local linear, 320 
local weighted average, 307-8 
Lowess, 320 
nearest-neighbors, 308-9, 319-20 
series, 321 
statistical inference intuition, 309-11 
test against parametric model, 319 
see also semiparametric regression 
nonrandomly varying coefficient, 846 
normal copula, 654 
normal distribution, 140 
truncated moments, 540, 566-7 
normal limit product rule. See Cramer linear 
transformation 
NPML estimator. See nonparametric maximum 
likelihood 
NR method. See Newton-Raphson method 
NSW demonstration project. See National Supported 
Work 
nuisance parameters. See incidental parameters 
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numerical derivatives, 340, 350 orthogonal polynomials, 321, 329, 390 
numerical integration. See quadrature definition 390 
orthogonal regression approach, 920 
observational data, 40-8, 814-7 orthonormal polynomials, 321, 329, 390 
biased samples, 42-5 outcome equation, 547, 867 
clustering, 42 outer product (OP) estimate, 138, 241, 395 
identification strategies, 36-7 outer-product of the gradient (OPG) version 
measurement error, 46 LM test, 240-1 
missing data, 46 m-test, 262—4 
population, 40 small-sample performance, 262 
sample attrition, 47 overdispersion, 670-1, 674-6, 690 
sampling methods, 40-4, 815-7 measurement error, 915-6 
sampling units, 41, 815 panel data, 794, 806 
sampling without replacement, 816-7 tests for, 671 
survey methods, 41-2, 814-7 overidentification, 31, 100, 173, 176, 379-80, 747 
survey nonresponse, 45-6 see also GMM estimator 
types of data, 47-8 overidentifying restrictions (OIR) test 
observational equivalence, 29 asymptotic distribution, 181, 183 
odds ratio, 470 bootstrap, 379-80 
see also posterior odds ratio definition, 181, 267, 277 
OIR test. See overidentifying restrictions test panel data, 747, 756 
OLS estimator. See ordinary least squares overlap assumption, 864, 871 
omitted variables bias, 92-3, 700, 716 in RD design, 881 
LM tests for, 274 oversampling, 41, 478-9, 814, 872 
one-step GMM estimator, 187, 196 
panel, 746, 755 paired bootstrap, 360, 366-8, 376, 378 
see also two-stage least squares pairwise deletion, 928 
one-way individual-specific effects model. See biased standard errors, 928 
individual-specific effects model panel attrition, 739, 801 
on-site sampling, 43, 823 panel bootstrap, 377, 707, 746, 751, 789 
optimal Bayesian estimator, 434 panel data, 47 
optimal GMM estimator, 176, 179-81, 187, 195 panel data models and estimators, 695-810 
compared to 2SLS, 187-8 comparison to clustered data, 831-2 
optimal MD estimator, 202, 753 see also linear panel; nonlinear panel 
OPG. See outer-product of the gradient panel GMM estimators, 744-68, 789-90 
Orbit model, 914 application, 754-6 
order of magnitude, 954 Arellano-Bond estimator, 765-6 
ordered logit model, 520, 682 asymptotic distribution, 745-6 
ordered multinomial models, 519-20 bootstrap, 389-90 
ordered probit model, 520, 535 compared to MD estimator, 753 
ordinary least squares (OLS) estimator, 70-81 computation, 751-2 
asymptotic distribution, 73-4, 80-1 definition, 745 
bias in standard errors with clustering, 836-7 efficiency, 747, 756 
binary data, 471 exogeneity assumptions, 748-52 
clustered data, 833-7 instruments, 744, 747-51 
coefficient interpretation in misspecified model, IV estimators for FE model, 757-9 
91-2 IV estimators for RE model, 759-60 
consistency 72, 80 just-identified, 745 
definition, 71 nonlinear, 789-90 
example, 84-5 OIR test, 747, 756 
finite-sample distribution, 79 one-step GMM estimator, 746, 755 
heteroskedasticity-robust standard errors, 74-5, 81 overidentified, 745 
identification, 71-2 2SLS estimator, 746, 755 
inconsistency, 91, 95-6 two-step GMM estimator, 746, 755 
inefficiency, 80 variance matrix estimation, 751 
nonlinear, 150-9 panel GMM model, 744-66 
panel data, 702-3, 720-5 application, 754-6 
see also least squares estimators dynamic, 763-6 
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with individual-specific effects, 750-62 
without individual-specific effects, 744-53 
see also panel GMM estimators 
panel IV estimators. See panel GMM estimators 
panel-robust statistical inference, 377, 705-7, 722, 
746, 751, 788-90 
for Hausman test, 718 
Panel Study in Income Dynamics (PSID), 58, 889 
parametric bootstrap, 360 
Pareto distribution 
of the first kind, 609 
of the second kind, 616 
partial additive model, 323 
partial equilibrium analysis, 53, 862, 972 
see also SUTVA 
partial F-statistic, 105, 109, 111 
partial likelihood estimator, 594-6 
partial ML estimator, 140 
partial R-squared, 104-5, 111 
partially linear model, 323-5, 327, 565, 684 
participation equation, 547, 551 
Pearson chi-square goodness-of-fit test, 266 
Pearson residual, 289, 291 
peer-effects model, 832 
percentile, 86 
percentile method, 364-5, 367-8 
percentile-t method, 364, 366-7 
PH model. See proportional hazards 
piecewise constant hazard model, 591 
Pitman drift, 248 
PML estimator. See pseudo-ML estimator 
Poisson distribution, 668 
Poisson-gamma mixture, 675 
Poisson-IG mixture, 677 
Poisson regression model, 666-74 
application, 671-4, 690, 792-5, 850-3 
asymptotic distribution of estimators, 668-9 
bivariate, 686 
censored MLE, 535 
with clustered data, 844, 850-3 
coefficient interpretation, 669 
definition, 668 
equidispersion, 668 
example, 117-8, 121-2 
LEF density, 148 
measurement error, 915-8 
mixtures, 675-9 
ML estimator, 668 
overdispersion, 670-1 
panel data, 792-5, 802-6 
quasi-ML estimator, 668-9, 682-3 
truncated MLE, 535 
underdispersion, 671 
zero-truncated, 680 
see also count models 
polynomial baseline hazard, 591, 636 


pooled cross-section time series model. See pooled 


model 


pooled estimators, 702-3, 720-5 
application, 710-2, 725 
FGLS estimator, 720-1 
GEE estimator, 790, 794 
NLS estimator, 794 
OLS estimator, 211, 702-3, 720-5 
WLS estimator, 702-3, 721 
pooled model, 699, 720-5, 787-8 
pooling tests, 737 
population-averaged model. See pooled model 
population moment conditions 
for estimation, 172 
for testing, 260 
see also GMM estimator; MM estimator; m-tests 
posterior distribution, 421, 430-4 
asymptotic behavior, 432-4 
conditional posterior, 431 
definition, 421 
expected posterior loss, 434 
expected posterior risk, 434 
full conditional distribution, 431 
highest posterior density interval, 431 
highest posterior density region, 431 
marginal posterior, 430 
observed-data posterior, 930 
posterior density interval, 431 
posterior mean, 423, 434 
posterior mode, 433 
posterior moments, 430 
posterior precision, 423 
see also Bayesian methods 
posterior odds ratio, 456 
posterior (P) step, 455, 933 
potential outcome model, 30-4, 861-5 
see also treatment effects; treatment evaluation 
power of tests, 247-50, 253-4 
bootstrapped tests, 372-3 
conditional moment test, 267-9 
example, 253-4 
Hausman test, 273-4 
local alternative hypotheses, 247-8 
uniformly most powerful test, 237 
Wald tests, 248-50 
precision parameter, 423 
predetermined instruments. See weak exogeneity 
prediction, 66-70 
best linear, 70 
conditional, 66 
error, 66-70 
in linear panel models, 738 
in mixed linear model, 774-6 
optimal, 66-70 
rotation groups, 814 
in structural model, 28 
weighted, 821 
pretest estimator, 285 
primary sampling units (PSUs), 41, 815, 
845-55 
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prior distribution, 425-30 
conjugate prior, 427 
definition, 420 
Dickey’s prior, 439 
diffuse prior, 426 
flat prior, 426 
hierarchical priors, 428-9, 441-2 
improper prior, 426 
informative prior, 437-9 
Jeffreys’ prior, 426 
noninformative prior, 425, 435-7 
normal-gamma prior, 437 
sensitivity analysis for, 429-30 
see also Bayesian methods 
probit model, 470-71 
application, 465-6 
as additive random utility model, 477 
bivariate probit, 522-3 
bootstrap example, 254-6 
definition, 470 
discrete-time duration data, 602 
as GLM, 149 
index function model, 476 
logit model comparison, 471-3 
marginal effects, 467, 471 
ML estimator, 470 
Monte Carlo study example, 251-4 
multinomial probit, 516-9 
ordered probit, 520, 535 
panel data, 795-6 
simultaneous equations probit, 523, 560-1 
see also binary outcome models 
probit selection equation, 548 
product copula, 654 
product integral, 578 
product rule, 949 
see also Cramer linear transformation 
program evaluation. See treatment evaluation 
projection pursuit model, 323 
propensity score, 864-5 
application, 893-4 
balancing condition, 864, 893-4 
conditional independence assumption, 865 
definition, 864 
matching, 873-8, 892 
see also treatment evaluation 
proportional hazards (PH) model, 592-7 
application, 605-7 
baseline survivor function estimator, 596-7 
coefficient interpretation, 606-7 
competing risks model, 645-6 
definition, 591 
discrete-time model, 600-3 
leading examples, 585 
mixed PH, 611-25 
panel data, 802 
partial likelihood estimator, 594-6 


pseudo-ML estimator (PML). See quasi-ML estimator 


pseudo panels, 771-3 
cohort, 771 
cohort fixed effects, 772-3 
measurement error, 772-3 
pseudo-random number generators, 410-6, 957-9 
accept-reject methods, 413-4 
composition methods, 415 
inverse transformation method, 413 
leading distributions, 957-9 
multivariate normal, 416 
transformation method, 413 
uniform variates, 412 
see also MCMC methods 
pseudo R-squared measures 
for binary outcome models, 473-4 
definitions, 287-9 
example, 290-1 
for multinomial outcome models, 499 
pseudo-true value, 94, 132, 146, 281 
PSID. See Panel Study in Income Dynamics 
PSUs. See primary sampling units 
pure exogenous sampling, 825 
p-value, 226, 229, 234, 286, 363 


quadrature, 388-90 

Gaussian, 389-90 

multidimensional, 393 

in nonlinear panel models, 785-6, 796, 800 

see also Monte Carlo integration 
qualititative response models. See binary outcomes, 

multinomial outcomes 

quantile, 86-7 
quantile regression, 85—90 

application, 88-90 

asymmetric absolute loss, 68, 85 

asymptotic distribution, 88 

bootstrap, 381 

computation, 341 

definition, 87 

IV estimator, 190 

multiplicative heteroskedasticity, 86-7 
quasi-difference, 783-4 
quasi-experiment. See natural experiment 
quasi-maximum likelihood (QML) estimator, 146-50 

asymptotic distribution, 146 

in binary outcome models, 469 

in clustered models, 842-3 

definition, 146 

in LEF, 147-9 

with multivariate dependent variable, 150 

in nonlinear systems, 216 

in panel models, 768, 786 

in Poisson model, 668-9, 682-3 
quasi-random numbers. See pseudo-random numbers 
QML estimator. See quasi-ML estimator 


random assignment, 49-50, 862 
see also sampling schemes 
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random coefficients model, 94, 385, 774-6, 786 
see also hierarchical models 
random effects (RE) estimator, 705, 734-6, 759-62, 
785-6 
application, 710-1, 725 
asymptotic distribution, 735 
clustered data, 837-9, 843-4 
consistency, 699, 764 
definition, 705, 734 
error components 2SLS estimator, 760 
error components 3SLS estimator, 762 
FGLS estimator, 734-6 
GEE estimator, 790, 794, 804 
Hausman test, 717-9 
incidental parameters, 704, 726 
IV estimators, 759-60 
ML estimator, 736, 785—6, 794-7, 800-1, 803-4 
NLS estimator, 787, 794 
quasi-ML estimator, 791 
two-way effects model, 738 
versus fixed effects, 701—2, 715-9 
random effects (RE) model, 700-2, 734-6, 759-62, 
785-6 
binary outcome models, 795-6 
Chamberlain model, 719, 786 
clustered data, 831, 843-4 
count models, 794, 803-4 
definition, 700, 734 
dynamic models, 792 
duration models, 801-2 
endogenous regressors, 756-7, 759-62 
Mundlak model, 719 
nonlinear models, 785-6 
selection models, 801 
Tobit model, 800-1 
two-way effects model, 738 
versus random effects, 701—2, 715-9 
see also hierarchical models; random effects 


estimator 
random number generators. See pseudo-random 
numbers 
random parameters logit (RPL) model, 512-6 
Bayesian methods, 514 
definition, 513 
ML estimator, 513-4 
random parameters model. See random coefficients 
model 
random utility models. See ARUM 
randomization bias, 53, 867 
randomized experiment, 50-3 
National Supported Work demonstration project, 
889 
randomized trials, 49-53 
randomly varying coefficient, 847-8 
rank condition for identification, 31, 182, 214 
rank-ordered logit model, 521 
rank-ordered probit model, 521 
raw residual, 289, 291 


RD design. See regression discontinuity design 
receiver operators characteristics (ROC) curve, 474 
reduced form, 21, 25, 213 
see also structural model 
RE estimator. See random effects 
regression-based imputation, 930-2 
EM algorithm, 932 
nonignorable missingness, 932 
regression discontinuity (RD) design, 879-83 
fuzzy RD design, 882 
heterogeneous treatment effects, 882 
RD estimator, 882-3 
sharp RD design, 880-1 
treatment assignment mechanism, 879-81 
regressors, 71 
alternative-varying, 478, 497-8 
endogenous, 23-33 
fixed, 76-7 
irrelevant, 93 
omitted, 92-3 
stochastic, 77 
time-varying, 597—600, 702, 749-51 
see also endogenous regressors 
regularity conditions for ML, 141-2, 151-6 
relative risk, 470, 503 
reliability ratio, 903 
renewal function, 626 
renewal process, 626, 638 
repeated cross section data, 47, 770-3 
see also differences-in-differences 
repeated measures. See panel data 
replicated data, 910-1, 913 
RESET test, 277-8 
residual analysis 
definitions, 289-90 
duration data, 633-6 
example, 290-1 
panel data, 714-5 
small-sample correction, 289 
residual bootstrap, 361 
response-based sampling, 43 
restricted ML estimator, 233, 776 
revealed preference data, 498, 516 
ridge regression estimator, 440 
Robinson difference estimator, 324—5, 565 
robust sandwich variance matrix estimate. See 
sandwich variance matrix 
robust standard errors 
bootstrap, 362-3, 376-8 
Eicker-White, 74-5, 80-1, 112, 137 
for extremum estimator, 137-9 
Huber-White, 137, 144, 146 
Newey-West, 137, 175, 723 
see also cluster-robust; heteroskedasticity-robust; 
panel-robust; systems-robust 
ROC curve. See receiver operators characteristics 
curve 
rotating panels, 739 
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Roy model, 555-7, 562 
definition, 556 
dummy endogenous variable, 557 
Heckman two-step estimator, 556 
ML estimator, 556 
panel semiparametric estimation, 808 
as treatment effects model, 867 
RPL model. See random parameters logit 
R-squared, 287 
pseudo, 287-9 
uncentered, 241, 263 
running mean estimator, 308 


SA method. See simulated annealing 
sample attrition, 47 
sample moment conditions 
see population moment conditions 
sample selection bias, 44-5 
sample weights, 817-21, 853-6 
see also weighting 
sampling schemes 
assumptions for OLS, 76-78 
case-control, 479, 823 
choice-based sampling, 43, 478-9, 823 
endogenous sampling, 42-5, 78, 822-9, 856 
endogenous stratified sampling, 78, 820, 825-6, 
856 


exogenous stratified sampling, 42, 78, 814-5, 820, 


825, 856 
fixed in repeated samples, 76-7 
flow sampling, 44, 626 
multi-stage surveys, 41-2, 814-6, 853-6 
on-site sampling, 43, 823 
simple random sampling, 41, 76-7, 816 
stock sampling, 44, 626-7 
with replacement, 816 
without replacement, 816-7 
sandwich variance matrix 
clustered data, 834, 842 
extremum estimator, 132, 137-9 
GMM estimator, 175 
ML estimator, 144, 148 
NLS estimator, 150 
OLS estimator, 74 
panel data, 705-7, 722, 746, 751 
for Wald test, 277 
see also robust standard errors 
Sargan test, 277 
see also overidentifying restrictions test 
scale parameter, 509 
scanner data, 499 
Schwarz criterion. See BIC 
SCLS estimator. See symmetrically censored least 
squares 
score test, see Lagrange multiplier test 
score vector, 141 
secondary sampling units (SSUs), 41, 815, 854 
seed, 411 


seemingly unrelated regressions (SUR) model, 
209-10, 216 
Bayesian MCMC example, 452-4 
count data, 685 
error components, 762 
nonlinear, 216 
selection bias, 445 
nonignorable missingness, 927, 932, 940 
treatment effects models, 867—71 
see also selection models 
selection models, 546-62 
bivariate sample selection model, 547-53 
count models, 680 
example, 553-5 
panel data, 801 
Roy model, 555-7, 867 
sample selection, 546 
self selection, 546 
semiparametric estimation, 565-6 
structural models, 558-62 
treatment effects model, 862—4 
versus selection on observables only, 552-3, 864, 
868-71 
versus two-part models, 546, 552-3 
see also Tobit models 
selection on observables only, 552-3, 862-4, 868-9, 
878-3, 889-96 
compared to selection models, 552-3, 864, 871 
conditional independence assumption, 868 
control function estimator, 869 
definition, 868-9 
DID estimator, 878-9 
RD design estimator, 879-83 
treatment effects model, 862—4, 889-96 
selection on unobservables, 552-3, 865-71, 883-9 
definition, 868 
in treatment effects model, 862—4 
IV estimators, 883-9 
Roy model, 867 
selection bias, 867-71 
selection model, 552-3 
self-weighting sample, 818 
SEM. See simultaneous equations model 
seminonparametric ML estimator, 328-9, 485 
semiparametric efficiency bounds, 323, 329-30, 485 
semiparametric estimators, 322-30 
adaptive, 323 
application, 565 
average derivative estimator, 326 
efficiency bounds, 323, 329-30 
nonparametric FGLS, 328 
Robinson difference estimator, 324—5, 565 
semiparametric least squares, 327, 483 
seminonparametric ML estimator, 328-9, 485 
see also semiparametric models 
semiparametric heterogeneity model, 622 
see also finite mixture models 
semiparametric least squares, 327, 483 


1030 


SUBJECT INDEX 


semiparametric ML estimator, 328-9, 485 
semiparametric models, 322-30 
additive models, 327 
binary outcome models, 482-6 
censored models, 563-5 
count models, 684—5 
definition, 322 
duration models, 594—600, 601-2 
flexible parametric models, 563 
heteroskedastic linear model, 323, 328 
identification, 325-6 
leading examples, 322 
multinomial outcome models, 523-4 
panel data models, 808 
partially linear model, 324-5 
selection models, 565-6 
single-index models, 325-7 
see also semiparametric estimators 
sequential limits, 767 
sequential multinomial models, 520-1 
sequential two-step m-estimator, 200-2 
bootstrap for, 362 
sequence of random variables, 943, 945 
serial correlation. See autocorrelation 
set identification, 29 
series estimator, 321 
for binary outcomes, 483 
shared frailty model, 662 
short panel 
definition, 700 
statistical inference in, 705-8, 721-2, 746, 751, 768 
shrinkage estimator, 440 
Silverman’s plug-in estimate, 304 
simple random sampling (SRS), 41, 76-7, 816 
simple stratified sampling, 818 
Simpson’s rule, 388-9 
simulated annealing (SA) method, 347 
simulated m-estimator, 398-9 
simulation-based estimation methods, 364—418 
motivating examples, 385-6 
see MSL, MSM, indirect inference, simulators 
simulators, 393-4, 406-10 
antithetic sampling, 408-9 
direct, 393 
frequency, 406 
GHK, 407-8 
Halton sequences, 409-10 
importance sampling, 407 
smooth, 407 
subsimulator, 394 
unbiased, 394, 400 
see also quadrature 
simultaneous equations model (SEM), 22-31, 213-4, 
219 
causal interpretation, 26 
error components, 762 
extension to nonlinear models, 27 
FIML estimator, 214 


identification, 29-31, 213-4 

LIML estimator, 214 

nonlinear, 219 

order condition, 213 

rank condition, 214 

reduced form, 25, 213 

single-equation models, 31 

structural form, 25, 213 

structural model, 24 

2SLS estimator, 214 

3SLS estimator, 214 
simultaneous equations probit, 523, 560-1 
simultaneous equations Tobit, 560-1 
single-index models, 123, 323, 325-7 

definition, 123 

identification, 325 

marginal effects, 123 

nonlinear panel model, 780 

semiparametric estimators, 325-7 
SIPP. See Survey of Income and Program Participation 
size of test, 246-7, 251-3 

nominal size, 251 

size-corrected test, 251 

true size, 251-3 
Sklar’s theorem, 652 
Slutsky’s Theorem, 945-6 

alternative version, 949 
small-sample bias. See finite-sample bias 
smooth maximum score estimator, 484 
smoothing parameters, 307 
smoothing spline estimator, 321 
social experiments, 32, 48-54 

advantages, 50-2 

examples, 51, 889 

limitations, 52-4 

randomization, 49-50 
span, 320 
specific to general test, 285 
specification tests, 259-78 

for clustered data, 840 

for duration models, 628-32 

for endogeneity, 275-6 

for exogeneity, 277 

for heteroskedasticity, 275 

for individual-specific effects, 737 

for omitted variables, 274 

for overdispersion, 670-1 

for pooling, 737 

for unobserved heterogeneity, 628-32 

for Tobit model, 543-4 

see also m-tests; model diagnostics 
spherical errors, 78 
split-sample IV estimator, 191-2 
SRS. See simple random sampling 
SSUs. See secondary sampling units 
stable family of distributions, 621 
stable unit treatment value assumption (SUTVA), 872 
standard errors. See robust standard errors 
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starting values, 340, 351 Kaplan-Meier estimator, 581-2, 604-5 
state dependence. See true state dependence in mixture models, 615-6 
stated preference data, 498, 516 multivariate, 649-50 
stationary population, 40 parametric examples, 585 
statistical packages, 349 SUTVA. See stable unit treatment value assumption 
step size adjustment, 338 switching regressions model. See Roy model 
stochastic order of magnitude, 954-5 symmetrically censored least squares (SCLS) 
stock sampling, 44, 626-7 estimator, 565 
strata, 41, 815 synthetic panels. See pseudo panels 
see also sampling schemes; weighting systems of equations, 206-19 
stratification matching, 875-6, 893-6 linear systems, 206-14 
stratified random sampling, 76-7, 814-5 nonlinear systems, 214-9 
use of Liapounov CLT, 951 seemingly unrelated regression, 209-10, 216 
use of Markov LLN, 948 simultaneous equations model, 22-31, 213-4, 219 
see also sampling schemes; weighting systems-robust standard errors, 208-9, 212, 219 
strict exogeneity. See strong exogeneity 
strong consistency, 947 target density, 444 
strong exogeneity, 22 tests. See hypothesis tests, m-tests, specification tests 
in panel models, 700, 749-50, 752, 781 three-stage least squares (3SLS) estimator, 214 
structural approach 3SLS estimator. See three-stage least squares 
to measurement error, 901 time series data 
to weighting, 820-1 bootstrap, 381 
structural economic models, 28, 171 NLS estimator, 158-9 
with selection, 558-60 Newey-West standard errors, 137, 175, 727 
structural form, 20, 25, 223 time-varying regressors 
structural model, 20-31, 35-6 in duration models, 597-9 
based on economic model, 28 in panel data models, 702, 749-51 
exogeneity, 22-3 Tobit model, 536-44 
full information, 35 Bayesian methods, 563 
limited information, 35 censored mean, 538-41 
reduced form, 21, 25, 223 censoring mechanism, 532, 579 
structural form, 20, 25, 223 consistency of MLE, 538 
structure, 20 definition, 536 
see also simultaneous equations model example, 530-1 
structural selection models, 558-62 generalized, 548 
based on utility maximization, 558-60 Heckman two-step estimator, 543, 567-8 
endogenous regressors, 561—2 identification, 536 
simultaneous equations Tobit, 560-1 as imputation method, 932 
studentized statistic, 359 inverse-Mills ratio, 540-1 
subsampling method, 373 marginal effects, 541-2 
substitution bias, 53, 867 measurement error in dependent variable, 914 
sufficient statistic, 732, 782, 799, 805 ML estimator, 537-8 
definition, 782 NLS estimator, 542 
summation assumption, 748, 752 OLS estimator, 543 
superpopulation, 40, 816 panel data, 800-1 
supersmoother, 321 simultaneous equations, 560-1 
SUR model. See seemingly unrelated regressions specification tests, 543-4 
survey methods, 41-2, 84-7, 814-8, 853-6 with stochastic thresholds, 547 
survey nonresponse, 45-6, 60, 739 with truncated data, 538 
see also attrition bias; imputation methods truncated mean, 538-41, 566-7 
Survey of Income and Program Participation (SIPP), two-limit, 536 
59 type 2, 547 
survival analysis. See duration models type 5, 557 
survival function. See survivor function see also selection models 
survivor function top-coded data, 532-3, 541, 563 
aggregate survivor function, 619 transformation methods, 413 
definition, 576-8 transformation theorem, 949 
estimator in PH model, 596-7 transformed ML estimator, 766 
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transition data. See duration models 
trapezoidal rule, 388 
treatment-control comparison 
application, 890-1 
treatment effects framework, 862-5, 871-8, 889-96 
balancing condition, 864, 893-4 
binary treatment variable, 862 
conditional independence assumption, 863, 865 
conditional mean independence assumption, 864 
heterogeneous treatment effects, 882, 885 
multiple treatments, 860 
overlap assumption, 864, 871 
propensity score, 864-5 
Roy model, 867 
stable unit treatment value assumption, 872 
see also treatment evaluation 
treatment evaluation, 860-98 
application, 889-96 
IV estimators, 883-9 
matching estimators, 871-8 
DID estimators, 878-9 
selection bias, 865-71 
selection on observables, 862—4, 878-3, 889-96 
selection on unobservables, 865-71, 883-9 
regression discontinuity design, 879-83 
see also treatment effects framework 
treatment group, 49, 862 
trimming, 316, 333 
trivariate reduction, 686 
true state dependence 
duration models, 612, 630, 636 
dynamic panel models, 763-4, 798, 802 
see also unobserved heterogeneity 
truncated models, 530-44 
conditional mean, 535 
count models, 679-80 
definition, 532 
examples, 530-1, 535 
ML estimator, 534 
see also Tobit model; selection models 
truncated moments of standard normal, 540, 566-7 
truncation mechanisms, 532 
truncation from above, 532 
truncation from below, 532 
2SLS estimator. See two-stage least squares 
two-limit Tobit model, 536 
two-part model, 544-6 
application, 553-5 
compared to selection models, 546, 552-3 
definition, 545 
example, 545-6 
see also hurdle model 
two-stage IV estimator, 187 
two-stage least squares (2SLS) estimator, 101-2, 
187-91 
alternatives to, 190-2 
Basmann’s approach, 190-1 
compared to optimal GMM, 187-8 


as GLS in transformed model, 188-9 
as GMM estimator, 187 

nonlinear, 195-6, 199 

panel data, 746, 755 

in SEM, 214 

Theil’s interpretation, 189-90 


two-stage sampling, 41, 818 
two-step estimators 


GMM, 176, 187 
Heckman, 543, 550-1, 556, 567-8 
sequential m-estimator, 200-2 


two-step GMM estimator, 176, 187 


panel, 746, 755 


two-way effects model, 738 

type I error, 246-7 

type II error, 246-7 

type 1 extreme value distribution, 477, 486-7 


duration model error, 590 
multinomial logit model, 505 


type 2 Tobit. See bivariate sample selection model 
type 5 Tobit. See Roy model 


ultimate sampling units (USUs), 41, 815 
unbalanced panels, 739 

uncentered explained sum of squares (ESS), 241 
uncentered R-squared, 241, 263 
unconfoundedness assumption. See conditional 


independence assumption 


underrecording, 915 

undersmoothing, 305, 333, 380 

uniform convergence in probability, 126, 301 
uniform number generators, 412 

uniformly most powerful (UMP) test, 247 
unit roots, 382, 767-8 

universal logit model, 500 

unobserved heterogeneity 


application, 632-6 

in competing risks model, 647 

in count models, 675-7, 686 

distributions for, 614—5, 620-1 

in duration models, 611-25 

finite mixture models for, 621-5 

identification, 618-20 

IM test for, 267 

individual-specific effects, 700, 764 

mixture models for, 613-21 

MSL example, 397-8 

MSM example, 403 

multiplicative, 613, 686 

in nonlinear systems, 215 

specification tests for, 629-32 

variance inflation, 614 

versus true state dependence, 612, 630, 636, 763-4, 
798, 802 


USUs. See ultimate sampling units 


validation sample, 911 
variance components, 735, 845 
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variance matrix estimation 

BHHH estimate, 138 

degrees-of-freedom adjustment, 75, 102, 138, 

185-6, 278, 841 

expected Hessian estimate, 138 

for extremum estimator, 137-9 

for GMM estimator, 174-5 

Hessian estimate, 138 

for NLS estimator, 154-5 

OPG estimate, 138 

robust estimate, 137 

sandwich estimate, 137, 144 

for weighted estimators, 854-6 

see also robust standard errors 
variance reduction for simulation, 478 


Wald estimator 
in treatment effects models, 886 
Wald test, 136-7, 224-33 
asymptotic distribution, 226-8 
comparison with LM and, LR tests, 238-9 
definition, 136 
examples, 236, 241-3 
exclusion restrictions, 227 
F-test version, 226 
introduction, 136-7 
lack of invariance, 232-3 
likelihood based, 234, 241-3 
linear models, 224—5 
linear restrictions, 136-7 
in misspecified models, 229-30 
nonlinear restrictions, 224, 229 
power, 248-50 
of statistical significance, 228 
t-test version, 226-8 
see also hypothesis tests 
weak consistency, 947 
weak exogeneity, 22 
in panel data, 749, 752, 758 
weak instruments, 100, 104-12 
application, 110-2 
definition, 104 
finite sample bias, 108-12, 177-8, 191-2, 196 
GMM estimator, 177-8 
inconsistency, 105-7 
indicators 104—5, 756 
panel data, 751-2, 756 
Weibull distribution, 584-6 
Weibull-gamma regression model, 615 
Weibull regression model, 143-4, 589, 606-8, 635 
weighted estimation 
endogenous stratification, 828-9 
exogenous stratification, 818-20 


weighted exogenous sampling ML (WESML) 
estimator, 828 
weighted least squares (WLS) estimator, 81-5 
asymptotic distribution, 83 
contrasted with GLS, 83 
definition, 83 
example, 84-5 
in pooled model, 702-3, 721 
see also FGLS estimator 
weighted maximum likelihood (WML) estimator, 
828 
weighted semiparametric least squares (WSWL) 
estimator, 327 
for binary outcome models, 485 
weighting, 817-21, 827-9, 853-6 
descriptive versus structural approach, 820 
with endogenous stratification, 827-9 
sample weights, 817-8 
variance estimation, 853-6 
weighted prediction, 821 
weighted regression, 818-20 
whether to weight, 820-1 
welfare analysis 
with ARUM, 506-7 
with nested logit model, 512 
WESML estimator. See weighted exogenous sampling 
ML 
White standard errors. See robust standard errors 
wild bootstrap, 377-8 
window width, 299, 307, 312 
Wishart distribution, 443 
see also inverse-Wishart distribution 
within estimator. See fixed effects estimator 
within model. See fixed effects model 
within-group variation, 709, 733 
with-zeros model, 681 
WLS estimator. See weighted least squares 
WML estimator. See weighted maximum likelihood 
WNLS estimator, 156-7 
asymptotic distribution, 156 
definition, 156 
example, 159-63 
as GLM, 158 
working matrix 
definition, 82 
for GLM estimator, 158 
for pooled GEE estimator, 794 
for pooled WLS estimator, 721 
for WLS estimator, 82-3 
WSLS estimator. See weighted semiparametric least 
squares 


zero-inflated count model, 680-1 
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